Merge branch 'dev' into final-reads-saving

2024-12-22 05:08:17 +00:00 · 2023-04-20 14:18:10 +02:00 · 2023-04-20 14:18:10 +02:00 · 0245c4880b
commit 0245c4880b
parent 105bac7bb9 255b492b44
12 changed files with 141 additions and 133 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -11,6 +11,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

 ### `Fixed`

+- [#271](https://github.com/nf-core/taxprofiler/pull/271/files) Improved standardised table generation documentation nd mOTUs manual database download tutorial (♥ to @prototaxites for reporting, fix by @jfy133)
+- [#269](https://github.com/nf-core/taxprofiler/pull/269/files) Reduced output files in AWS full test output due to very large files
+- [#270](https://github.com/nf-core/taxprofiler/pull/270/files) Fixed warning for host removal index parameter, and improved index checks (♥ to @prototaxites for reporting, fix by @jfy133)
+- [#274](https://github.com/nf-core/taxprofiler/pull/274/files) Substituted the samtools/bam2fq module with samtools/fastq module (fix by @sofstam)
+
 ### `Dependencies`

 ### `Deprecated`
--- a/conf/modules.config
+++ b/conf/modules.config
@ -374,13 +374,13 @@ process {
        ext.prefix = { "${meta.id}_${meta.run_accession}.unmapped" }
    }

-    withName: SAMTOOLS_BAM2FQ {
+    withName: SAMTOOLS_FASTQ {
        ext.prefix = { "${meta.id}_${meta.run_accession}.unmapped" }
        publishDir = [
            [
-                path: { "${params.outdir}/samtools/bam2fq" },
+                path: { "${params.outdir}/samtools/fastq" },
                mode: params.publish_dir_mode,
-                pattern: '*.fq.gz',
+                pattern: '*.fastq.gz',
                enabled: params.save_hostremoval_unmapped
            ],
            [
--- a/docs/output.md
+++ b/docs/output.md
@ -21,7 +21,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
 - [Bowtie2](#bowtie2) - Host removal for Illumina reads
 - [minimap2](#minimap2) - Host removal for Nanopore reads
 - [SAMtools stats](#samtools-stats) - Statistics from host removal
- [SAMtools bam2fq](#samtools-bam2fq) - Converts unmapped BAM file to fastq format (minimap2 only)
+- [SAMtools fastq](#samtools-fastq) - Converts unmapped BAM file to fastq format (minimap2 only)
 - [Analysis Ready Reads](#analysis-read-reads) - Optional results directory containing the final processed reads used as input for classification/profiling.
 - [Bracken](#bracken) - Taxonomic classifier using k-mers and abundance estimations
 - [Kraken2](#kraken2) - Taxonomic classifier using exact k-mer matches
@ -202,7 +202,7 @@ It is used with nf-core/taxprofiler to allow removal of 'host' (e.g. human) and/

 By default nf-core/taxprofiler will only provide the `.log` file if host removal is turned on. You will only have a `.bam` file if you specify `--save_hostremoval_bam`. This will contain _both_ mapped and unmapped reads. You will only get FASTQ files if you specify to save `--save_hostremoval_unmapped` - these contain only unmapped reads. Alternatively, if you wish only to have the 'final' reads that go into classification/profiling (i.e., that may have additional processing), do not specify this flag but rather specify `--save_analysis_ready_reads`, in which case the reads will be in the folder `analysis_ready_reads`.

-> ℹ️ Unmapped reads in FASTQ are only found in this directory for short-reads, for long-reads see [`samtools/bam2fq/`](#samtools-bam2fq)
+> ℹ️ Unmapped reads in FASTQ are only found in this directory for short-reads, for long-reads see [`samtools/fastq/`](#samtools-fastq)

 > ⚠️ The resulting `.fastq` files may _not_ always be the 'final' reads that go into taxprofiling, if you also run other steps such as run merging etc..

@ -229,11 +229,11 @@ By default, nf-core/taxprofiler will only provide the `.bam` file containing map

 > ℹ️ minimap2 is not yet supported as a module in MultiQC and therefore there is no dedicated section in the MultiQC HTML. Rather, alignment statistics to host genome is reported via samtools stats module in MultiQC report.

-> ℹ️ Unlike Bowtie2, minimap2 does not produce an unmapped FASTQ file by itself. See [`samtools/bam2fq`](#samtools-bam2fq)
+> ℹ️ Unlike Bowtie2, minimap2 does not produce an unmapped FASTQ file by itself. See [`samtools/fastq`](#samtools-fastq)

-### SAMtools bam2fq
+### SAMtools fastq

-[SAMtools bam2fq](http://www.htslib.org/doc/1.1/samtools.html) converts a `.sam`, `.bam`, or `.cram` alignment file to FASTQ format
+[SAMtools fastq](http://www.htslib.org/doc/1.1/samtools.html) converts a `.sam`, `.bam`, or `.cram` alignment file to FASTQ format

 <details markdown="1">
 <summary>Output files</summary>
--- a/docs/usage.md
+++ b/docs/usage.md
@ -296,7 +296,9 @@ nf-core/taxprofiler supports generation of Krona interactive pie chart plots for

 ##### Multi-Table Generation

-In addition to per-sample profiles, the pipeline also supports generation of 'native' multi-sample taxonomic profiles (i.e., those generated by the taxonomic profiling tools themselves or additional utility scripts provided by the tool authors).
+The main multiple-sample table from nf-core/taxprofiler is from a dedicated standalone tool originally developed for the pipeline - [Taxpasta](https://taxpasta.readthedocs.io/en/latest/). When providing `--run_profile_standardisation`, every classifier/profiler and database combination will get a standardised and multi-sample taxon table in the [`taxpasta/`](https://nf-co.re/taxprofiler/output) directory. These tables are structured in the same way, to facilitate comparison between the the results of the classifier/profiler
+
+In addition to per-sample profiles and standardised Taxpasta output, the pipeline also supports generation of 'native' multi-sample taxonomic profiles (i.e., those generated by the taxonomic profiling tools themselves or additional utility scripts provided by the tool authors), when providing `--run_profile_standardisation` to your pipeline.

 These are executed on a per-database level. I.e., you will get a multi-sample taxon table for each database you provide for each tool and will be placed in the same directory as the directories containing the per-sample profiles.

@ -309,7 +311,7 @@ The following tools will produce multi-sample taxon tables:
 - **MetaPhlAn3** (via MetaPhlAn's `merge_metaphlan_tables.py` script)
 - **mOTUs** (via the `motus merge` command)

-Note that the multi-sample tables from these folders are not inter-operable with each other as they can have different formats.
+Note that the multi-sample tables from the 'native' tools in each folders are [not inter-operable](https://taxpasta.readthedocs.io/en/latest/tutorials/getting-started/) with each other as they can have different formats and can contain additional and different data. In this case we refer you to use the standardised and merged output from Taxpasta, as described above.

 ### Updating the pipeline

@ -794,6 +796,8 @@ More information on the MetaPhlAn3 database can be found [here](https://github.c

 mOTUs does not provide the ability to construct custom databases. Therefore we recommend to use the the prebuilt database of marker genes provided by the developers.

+> ⚠️ **Do not change the directory name of the resulting database if moving to a central location** The database name of `db_mOTU/` is hardcoded in the mOTUs tool
+
 To do this you need to have `mOTUs` installed on your machine.

 ```bash
--- a/modules.json
+++ b/modules.json
@ -187,9 +187,9 @@
                        "git_sha": "c8e35eb2055c099720a75538d1b8adb3fb5a464c",
                        "installed_by": ["modules"]
                    },
-                    "samtools/bam2fq": {
+                    "samtools/fastq": {
                        "branch": "master",
-                        "git_sha": "c8e35eb2055c099720a75538d1b8adb3fb5a464c",
+                        "git_sha": "0f8a77ff00e65eaeebc509b8156eaa983192474b",
                        "installed_by": ["modules"]
                    },
                    "samtools/index": {
--- a/modules/nf-core/samtools/bam2fq/main.nf
+++ b/modules/nf-core/samtools/bam2fq/main.nf
@ -1,56 +0,0 @@
-process SAMTOOLS_BAM2FQ {
-    tag "$meta.id"
-    label 'process_low'
-
-    conda "bioconda::samtools=1.16.1"
-    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
-        'https://depot.galaxyproject.org/singularity/samtools:1.16.1--h6899075_1' :
-        'quay.io/biocontainers/samtools:1.16.1--h6899075_1' }"
-
-    input:
-    tuple val(meta), path(inputbam)
-    val split
-
-    output:
-    tuple val(meta), path("*.fq.gz"), emit: reads
-    path "versions.yml"             , emit: versions
-
-    when:
-    task.ext.when == null || task.ext.when
-
-    script:
-    def args = task.ext.args ?: ''
-    def prefix = task.ext.prefix ?: "${meta.id}"
-
-    if (split){
-        """
-        samtools \\
-            bam2fq \\
-            $args \\
-            -@ $task.cpus \\
-            -1 ${prefix}_1.fq.gz \\
-            -2 ${prefix}_2.fq.gz \\
-            -0 ${prefix}_other.fq.gz \\
-            -s ${prefix}_singleton.fq.gz \\
-            $inputbam
-
-        cat <<-END_VERSIONS > versions.yml
-        "${task.process}":
-            samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//')
-        END_VERSIONS
-        """
-    } else {
-        """
-        samtools \\
-            bam2fq \\
-            $args \\
-            -@ $task.cpus \\
-            $inputbam | gzip --no-name > ${prefix}_interleaved.fq.gz
-
-        cat <<-END_VERSIONS > versions.yml
-        "${task.process}":
-            samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//')
-        END_VERSIONS
-        """
-    }
-}
--- a/modules/nf-core/samtools/bam2fq/meta.yml
+++ b/modules/nf-core/samtools/bam2fq/meta.yml
@ -1,55 +0,0 @@
-name: samtools_bam2fq
-description: |
-  The module uses bam2fq method from samtools to
-  convert a SAM, BAM or CRAM file to FASTQ format
-keywords:
-  - bam2fq
-  - samtools
-  - fastq
-tools:
-  - samtools:
-      description: Tools for dealing with SAM, BAM and CRAM files
-      homepage: None
-      documentation: http://www.htslib.org/doc/1.1/samtools.html
-      tool_dev_url: None
-      doi: ""
-      licence: ["MIT"]
-
-input:
-  - meta:
-      type: map
-      description: |
-        Groovy Map containing sample information
-        e.g. [ id:'test', single_end:false ]
-  - inputbam:
-      type: file
-      description: BAM/CRAM/SAM file
-      pattern: "*.{bam,cram,sam}"
-  - split:
-      type: boolean
-      description: |
-        TRUE/FALSE value to indicate if reads should be separated into
-        /1, /2 and if present other, or singleton.
-        Note: choosing TRUE will generate 4 different files.
-        Choosing FALSE will produce a single file, which will be interleaved in case
-        the input contains paired reads.
-
-output:
-  - meta:
-      type: map
-      description: |
-        Groovy Map containing sample information
-        e.g. [ id:'test', single_end:false ]
-  - versions:
-      type: file
-      description: File containing software versions
-      pattern: "versions.yml"
-  - reads:
-      type: file
-      description: |
-        FASTQ files, which will be either a group of 4 files (read_1, read_2, other and singleton)
-        or a single interleaved .fq.gz file if the user chooses not to split the reads.
-      pattern: "*.fq.gz"
-
-authors:
-  - "@lescai"
--- a/modules/nf-core/samtools/fastq/main.nf
+++ b/modules/nf-core/samtools/fastq/main.nf
@ -0,0 +1,44 @@
+process SAMTOOLS_FASTQ {
+    tag "$meta.id"
+    label 'process_low'
+
+    conda "bioconda::samtools=1.16.1"
+    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
+        'https://depot.galaxyproject.org/singularity/samtools:1.16.1--h6899075_1' :
+        'quay.io/biocontainers/samtools:1.16.1--h6899075_1' }"
+
+    input:
+    tuple val(meta), path(input)
+    val(interleave)
+
+    output:
+    tuple val(meta), path("*_{1,2}.fastq.gz")      , optional:true, emit: fastq
+    tuple val(meta), path("*_interleaved.fastq.gz"), optional:true, emit: interleaved
+    tuple val(meta), path("*_singleton.fastq.gz")  , optional:true, emit: singleton
+    tuple val(meta), path("*_other.fastq.gz")      , optional:true, emit: other
+    path  "versions.yml"                           , emit: versions
+
+    when:
+    task.ext.when == null || task.ext.when
+
+    script:
+    def args = task.ext.args ?: ''
+    def prefix = task.ext.prefix ?: "${meta.id}"
+    def output = ( interleave && ! meta.single_end ) ? "> ${prefix}_interleaved.fastq.gz" :
+        meta.single_end ? "-1 ${prefix}_1.fastq.gz -s ${prefix}_singleton.fastq.gz" :
+        "-1 ${prefix}_1.fastq.gz -2 ${prefix}_2.fastq.gz -s ${prefix}_singleton.fastq.gz"
+    """
+    samtools \\
+        fastq \\
+        $args \\
+        --threads ${task.cpus-1} \\
+        -0 ${prefix}_other.fastq.gz \\
+        $input \\
+        $output
+
+    cat <<-END_VERSIONS > versions.yml
+    "${task.process}":
+        samtools: \$(echo \$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\$//')
+    END_VERSIONS
+    """
+}
--- a/modules/nf-core/samtools/fastq/meta.yml
+++ b/modules/nf-core/samtools/fastq/meta.yml
@ -0,0 +1,62 @@
+name: samtools_fastq
+description: Converts a SAM/BAM/CRAM file to FASTQ
+keywords:
+  - bam
+  - sam
+  - cram
+  - fastq
+tools:
+  - samtools:
+      description: |
+        SAMtools is a set of utilities for interacting with and post-processing
+        short DNA sequence read alignments in the SAM, BAM and CRAM formats, written by Heng Li.
+        These files are generated as output by short read aligners like BWA.
+      homepage: http://www.htslib.org/
+      documentation: http://www.htslib.org/doc/samtools.html
+      doi: 10.1093/bioinformatics/btp352
+      licence: ["MIT"]
+
+input:
+  - meta:
+      type: map
+      description: |
+        Groovy Map containing sample information
+        e.g. [ id:'test', single_end:false ]
+  - input:
+      type: file
+      description: BAM/CRAM/SAM file
+      pattern: "*.{bam,cram,sam}"
+  - interleave:
+      type: boolean
+      description: Set true for interleaved fastq file
+
+output:
+  - meta:
+      type: map
+      description: |
+        Groovy Map containing sample information
+        e.g. [ id:'test', single_end:false ]
+  - versions:
+      type: file
+      description: File containing software versions
+      pattern: "versions.yml"
+  - fastq:
+      type: file
+      description: Compressed FASTQ file(s) with reads with either the READ1 or READ2 flag set in separate files.
+      pattern: "*_{1,2}.fastq.gz"
+  - interleaved:
+      type: file
+      description: Compressed FASTQ file with reads with either the READ1 or READ2 flag set in a combined file. Needs collated input file.
+      pattern: "*_interleaved.fastq.gz"
+  - singleton:
+      type: file
+      description: Compressed FASTQ file with singleton reads
+      pattern: "*_singleton.fastq.gz"
+  - other:
+      type: file
+      description: Compressed FASTQ file with reads with either both READ1 and READ2 flags set or unset
+      pattern: "*_other.fastq.gz"
+
+authors:
+  - "@priyanka-surana"
+  - "@suzannejin"
--- a/nextflow_schema.json
+++ b/nextflow_schema.json
@ -310,7 +310,7 @@
                    "type": "boolean",
                    "fa_icon": "fas fa-save",
                    "description": "Save reads from samples that went through the host-removal step",
-                    "help_text": "Save only the reads NOT mapped to the reference genome in FASTQ format (as exported from `samtools view` and `bam2fq`).\n\nThis can be useful if you wish to perform other analyses on the off-target reads from the host mapping, such as manual profiling or _de novo_ assembly."
+                    "help_text": "Save only the reads NOT mapped to the reference genome in FASTQ format (as exported from `samtools view` and `fastq`).\n\nThis can be useful if you wish to perform other analyses on the off-target reads from the host mapping, such as manual profiling or _de novo_ assembly."
                }
            },
            "fa_icon": "fas fa-user-times"
@ -473,15 +473,18 @@
                },
                "motus_use_relative_abundance": {
                    "type": "boolean",
-                    "description": "Turn on printing relative abundance instead of counts."
+                    "description": "Turn on printing relative abundance instead of counts.",
+                    "fa_icon": "fas fa-percent"
                },
                "motus_save_mgc_read_counts": {
                    "type": "boolean",
-                    "description": "Turn on saving the mgc reads count."
+                    "description": "Turn on saving the mgc reads count.",
+                    "fa_icon": "fas fa-save"
                },
                "motus_remove_ncbi_ids": {
                    "type": "boolean",
-                    "description": "Turn on removing NCBI taxonomic IDs."
+                    "description": "Turn on removing NCBI taxonomic IDs.",
+                    "fa_icon": "fas fa-address-card"
                }
            },
            "fa_icon": "fas fa-align-center"
@ -496,7 +499,7 @@
                    "type": "boolean",
                    "fa_icon": "fas fa-toggle-on",
                    "description": "Turn on standardisation of taxon tables across profilers",
-                    "help_text": "Turns on standardisation of output OTU tables across all tools; each into a TSV format following the following scheme:\n\n|TAXON   | SAMPLE_A | SAMPLE_B |\n|-------------|----------------|-----------------|\n| taxon_a | 32               | 123             |\n| taxon_b | 1                 | 5                 |\n\nThis currently only is generated for mOTUs."
+                    "help_text": "Turns on standardisation of output OTU tables across all tools.\n\nThis happens in two forms, firstly - if available - by a given classifiers/profilers 'native' profile merger and standardisation (for Bracken, Kaiju, Kraken, Centrifuge, MetaPhlAn3, mOTUs), and secondly for _all_ classifier/profilers in the pipeline using [`taxpasta`](https://taxpasta.readthedocs.io).\n\nIn the latter case, taxpasta generates a standardised output as follows:\n\n|TAXON   | SAMPLE_A | SAMPLE_B |\n|-------------|----------------|-----------------|\n| taxon_a | 32               | 123             |\n| taxon_b | 1                 | 5                 |\n\nwhereas all other 'native' tools have varying format outputs. See pipeline [output](https://nf-co.re/taxprofiler) documentation for more information."
                },
                "standardisation_motus_generatebiom": {
                    "type": "boolean",
--- a/subworkflows/local/longread_hostremoval.nf
+++ b/subworkflows/local/longread_hostremoval.nf
@ -5,7 +5,7 @@
 include { MINIMAP2_INDEX             } from '../../modules/nf-core/minimap2/index/main'
 include { MINIMAP2_ALIGN             } from '../../modules/nf-core/minimap2/align/main'
 include { SAMTOOLS_VIEW              } from '../../modules/nf-core/samtools/view/main'
-include { SAMTOOLS_BAM2FQ            } from '../../modules/nf-core/samtools/bam2fq/main'
+include { SAMTOOLS_FASTQ             } from '../../modules/nf-core/samtools/fastq/main'
 include { SAMTOOLS_INDEX             } from '../../modules/nf-core/samtools/index/main'
 include { SAMTOOLS_STATS             } from '../../modules/nf-core/samtools/stats/main'

@ -38,8 +38,8 @@ workflow LONGREAD_HOSTREMOVAL {
    SAMTOOLS_VIEW ( ch_minimap2_mapped , [], [] )
    ch_versions      = ch_versions.mix( SAMTOOLS_VIEW.out.versions.first() )

-    SAMTOOLS_BAM2FQ ( SAMTOOLS_VIEW.out.bam, false )
-    ch_versions      = ch_versions.mix( SAMTOOLS_BAM2FQ.out.versions.first() )
+    SAMTOOLS_FASTQ ( SAMTOOLS_VIEW.out.bam, false )
+    ch_versions      = ch_versions.mix( SAMTOOLS_FASTQ.out.versions.first() )

    // Indexing whole BAM for host removal statistics
    SAMTOOLS_INDEX ( MINIMAP2_ALIGN.out.bam )
@ -54,7 +54,7 @@ workflow LONGREAD_HOSTREMOVAL {

    emit:
    stats    = SAMTOOLS_STATS.out.stats     //channel: [val(meta), [reads  ] ]
-    reads    = SAMTOOLS_BAM2FQ.out.reads   // channel: [ val(meta), [ reads ] ]
+    reads    = SAMTOOLS_FASTQ.out.fastq.mix( SAMTOOLS_FASTQ.out.other)   // channel: [ val(meta), [ reads ] ]
    versions = ch_versions                 // channel: [ versions.yml ]
    mqc      = ch_multiqc_files
 }
--- a/workflows/taxprofiler.nf
+++ b/workflows/taxprofiler.nf
@ -35,7 +35,8 @@ if (params.shortread_qc_includeunmerged && !params.shortread_qc_mergepairs) exit
 if (params.shortread_complexityfilter_tool == 'fastp' && ( params.perform_shortread_qc == false || params.shortread_qc_tool != 'fastp' ))  exit 1, "ERROR: [nf-core/taxprofiler] cannot use fastp complexity filtering if preprocessing not turned on and/or tool is not fastp. Please specify --perform_shortread_qc and/or --shortread_qc_tool 'fastp'"

 if (params.perform_shortread_hostremoval && !params.hostremoval_reference) { exit 1, "ERROR: [nf-core/taxprofiler] --shortread_hostremoval requested but no --hostremoval_reference FASTA supplied. Check input." }
-if (!params.hostremoval_reference && params.hostremoval_reference_index) { exit 1, "ERROR: [nf-core/taxprofiler] --shortread_hostremoval_index provided but no --hostremoval_reference FASTA supplied. Check input." }
+if (params.perform_shortread_hostremoval && !params.hostremoval_reference && params.shortread_hostremoval_index) { exit 1, "ERROR: [nf-core/taxprofiler] --shortread_hostremoval_index provided but no --hostremoval_reference FASTA supplied. Check input." }
+if (params.perform_longread_hostremoval && !params.hostremoval_reference && params.longread_hostremoval_index) { exit 1, "ERROR: [nf-core/taxprofiler] --longread_hostremoval_index provided but no --hostremoval_reference FASTA supplied. Check input." }

 if (params.hostremoval_reference           ) { ch_reference = file(params.hostremoval_reference) }
 if (params.shortread_hostremoval_index     ) { ch_shortread_reference_index = Channel.fromPath(params.shortread_hostremoval_index).map{[[], it]} } else { ch_shortread_reference_index = [] }