Update docs based on feedback and add missing results directories

2024-11-21 22:16:05 +00:00 · 2023-03-12 08:35:45 +01:00 · 2023-03-12 08:35:45 +01:00 · e49668005f
commit e49668005f
parent efa398edab
4 changed files with 61 additions and 26 deletions
--- a/docs/output.md
+++ b/docs/output.md
@ -35,14 +35,18 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
 - [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline
 - [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution

+![](images/taxprofiler_tube.png)
+
 ### FastQC or Falco

 <details markdown="1">
 <summary>Output files</summary>

- `fastqc/`
-  - `*_fastqc.html`: FastQC or Falco report containing quality metrics.
-  - `*_fastqc.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images (FastQC only).
+- `{fastqc,falco}/`
+  - {raw,preprocessed}
+    - `*html`: FastQC or Falco report containing quality metrics in HTML format.
+    - `*.txt`: FastQC or Falco report containing quality metrics in TXT format.
+    - `*.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images (FastQC only).

 </details>

@ -186,9 +190,12 @@ It is used with nf-core/taxprofiler to allow removal of 'host' (e.g. human) and/
 <summary>Output files</summary>

 - `bowtie2/`
-  - `<sample_id>.bam`: BAM file containing reads that aligned against the user-supplied reference genome as well as unmapped reads
-  - `<sample_id>.bowtie2.log`: log file about the mapped reads
-  - `<sample_id>.unmapped.fastq.gz`: the off-target reads from the mapping that is used in downstream steps.
+  - `build/`
+    - `*.bt2`: Bowtie2 indicies of reference genome, only if `--save_hostremoval_index` supplied.
+  - `align/`
+    - `<sample_id>.bam`: BAM file containing reads that aligned against the user-supplied reference genome as well as unmapped reads
+    - `<sample_id>.bowtie2.log`: log file about the mapped reads
+    - `<sample_id>.unmapped.fastq.gz`: the off-target reads from the mapping that is used in downstream steps.

 </details>

@ -210,7 +217,10 @@ It is used with nf-core/taxprofiler to allow removal of 'host' (e.g. human) or o
 <summary>Output files</summary>

 - `minimap2`
-  - `<sample_id>.bam`: Alignment file in BAM format containing both mapped and unmapped reads.
+  - `build/`
+    - `*.mmi2`: minimap2 indicies of reference genome, only if `--save_hostremoval_index` supplied.
+  - `align/`
+    - `<sample_id>.bam`: Alignment file in BAM format containing both mapped and unmapped reads.

 </details>

@ -243,13 +253,31 @@ This directory will be present and contain the unmapped reads from the `.fastq`
 <details markdown="1">
 <summary>Output files</summary>

- `samtoolsstats`
+- `samtools/stats`
  - `<sample_id>.stats`: File containing samtools stats output.

 </details>

 In most cases you do not need to check this file, as it is rendered in the MultiQC run report.

+### Run Merging
+
+nf-core/taxprofiler offers the option to merge FASTQ files of multiple sequencing runs or libraries that derive from the same sample, as specified in the input samplesheet.
+
+This is the last preprocessing step, so if you have multiple runs or libraries (and run merging turned on), this will represent the final reads that will go into classification/profiling steps.
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `run_merging/`
+  - `*.fastq.gz`: Concatenated FASTQ files on a per-sample basis
+
+</details>
+
+Note that you will only find samples that went through the run merging step in this directory. For samples that had a single run or library will not go through this step of the pipeline and thus will not be present in this directory.
+
+⚠️ You must make sure to turn on the saving of the reads from the previous preprocessing step you may have turned on, if you have single-run or library reads in your pipeline run, and wish to save the final reads that go into classification/profiling!
+
 ### Bracken

 [Bracken](https://ccb.jhu.edu/software/bracken/) (Bayesian Reestimation of Abundance with Kraken) is a highly accurate statistical method that computes the abundance of species in DNA sequences from a metagenomics sample. Braken uses the taxonomy labels assigned by Kraken, a highly accurate metagenomics classification algorithm, to estimate the number of reads originating from each species present in a sample.
--- a/docs/usage.md
+++ b/docs/usage.md
@ -89,7 +89,9 @@ The pipeline takes the paths and specific classification/profiling parameters of

 > ⚠️ To allow user freedom, nf-core/taxprofiler does not check for mandatory or the validity of non-file database parameters for correct execution of the tool - excluding options offered via pipeline level parameters! Please validate your database parameters (cross-referencing [parameters](https://nf-co.re/taxprofiler/parameters, and the given tool documentation) before submitting the database sheet! For example, if you don't use the default read length - Bracken will require `-r <read_length>` in the `db_params` column.

-An example database sheet can look as follows, where 5 tools are being used, and `malt` and `kraken2` will be used against two databases each. This is because specifying `bracken` implies first running `kraken2` on the same database.
+An example database sheet can look as follows, where 7 tools are being used, and `malt` and `kraken2` will be used against two databases each.
+
+`kraken2` will be run twice even though only having a single 'dedicated' database because specifying `bracken` implies first running `kraken2` on the `bracken` database, as required by `bracken`.

 ```console
 tool,db_name,db_params,db_path
--- a/nextflow_schema.json
+++ b/nextflow_schema.json
@ -67,7 +67,7 @@
                "save_preprocessed_reads": {
                    "type": "boolean",
                    "fa_icon": "fas fa-save",
-                    "description": "Save reads from adapter clipping/pair-merging, length filtering for both short and long reads",
+                    "description": "Save reads from samples that went through the adapter clipping, pair-merging, and length filtering steps for both short and long reads",
                    "help_text": "This saves the FASTQ output from the following tools:\n\n- fastp\n- AdapterRemoval\n- Porechop\n- Filtlong\n\nThese reads will be a mixture of: adapter clipped, quality trimmed, pair-merged, and length filtered, depending on the parameters you set."
                }
            },
@ -116,7 +116,8 @@
                    "type": "string",
                    "default": "None",
                    "description": "Specify a list of all possible adapters to trim. Overrides --shortread_qc_adapter1/2. Formats: .txt (AdapterRemoval) or .fasta. (fastp).",
-                    "help_text": "Allows to supply a file with a list of adapter (combinations) to remove from all files. \n\nOverrides the --shortread_qc_adapter1/--shortread_qc_adapter2 parameters . \n\nFor AdapterRemoval this consists of a two column table with a `.txt` extension: first column represents forward strand, second column for reverse strand. You must supply all possible combinations, one per line, and this list is applied to all files. See AdapterRemoval documentation for more information.\n\nFor fastp this consists of a standard FASTA format with a `.fasta`/`.fa`/`.fna`/`.fas` extension. The adapter sequence in this file should be at least 6bp long, otherwise it will be skipped. fastp trims the adapters present in the FASTA file one by one.\n\n> Modifies AdapterRemoval parameter: --adapter-list\n> Modifies fastp parameter: --adapter_fasta"
+                    "help_text": "Allows to supply a file with a list of adapter (combinations) to remove from all files. \n\nOverrides the --shortread_qc_adapter1/--shortread_qc_adapter2 parameters . \n\nFor AdapterRemoval this consists of a two column table with a `.txt` extension: first column represents forward strand, second column for reverse strand. You must supply all possible combinations, one per line, and this list is applied to all files. See AdapterRemoval documentation for more information.\n\nFor fastp this consists of a standard FASTA format with a `.fasta`/`.fa`/`.fna`/`.fas` extension. The adapter sequence in this file should be at least 6bp long, otherwise it will be skipped. fastp trims the adapters present in the FASTA file one by one.\n\n> Modifies AdapterRemoval parameter: --adapter-list\n> Modifies fastp parameter: --adapter_fasta",
+                    "fa_icon": "fas fa-th-list"
                },
                "shortread_qc_mergepairs": {
                    "type": "boolean",
@ -194,7 +195,7 @@
                "save_complexityfiltered_reads": {
                    "type": "boolean",
                    "fa_icon": "fas fa-save",
-                    "description": "Save complexity filtered short-reads",
+                    "description": "Save reads from samples that went through the complexity filtering step",
                    "help_text": "Specify whether to save the final complexity filtered reads in your results directory (`--outdir`)."
                }
            },
@ -302,7 +303,7 @@
                "save_hostremoval_unmapped": {
                    "type": "boolean",
                    "fa_icon": "fas fa-save",
-                    "description": "Save unmapped reads in FASTQ format from host removal",
+                    "description": "Save reads from samples that went through the host-removal step",
                    "help_text": "Save only the reads NOT mapped to the reference genome in FASTQ format (as exported from `samtools view` and `bam2fq`).\n\nThis can be useful if you wish to perform other analyses on the off-target reads from the host mapping, such as manual profiling or _de novo_ assembly."
                }
            },
@ -323,8 +324,8 @@
                "save_runmerged_reads": {
                    "type": "boolean",
                    "fa_icon": "fas fa-save",
-                    "description": "Save run-concatenated input FASTQ files for each sample",
-                    "help_text": "Save the run- and library-concatenated reads of a given sample in FASTQ format."
+                    "description": "Save reads from samples that went through the run-merging step",
+                    "help_text": "Save the run- and library-concatenated reads of a given sample in FASTQ format.\n\n> \u26a0\ufe0f Only samples that went through the run-merging step of the pipeline will be stored in the resulting directory. \n\nIf you wish to save the files that go to the classification/profiling steps for samples that _did not_ go through run merging, you must supply the appropriate upstream `--save_<preprocessing_step>` flag.\n\n"
                }
            },
            "fa_icon": "fas fa-clipboard-check"
@ -427,7 +428,7 @@
                },
                "run_bracken": {
                    "type": "boolean",
-                    "description": "Post-process kraken2 reports with Bracken.",
+                    "description": "Turn on Bracken (and the required Kraken2 prerequisite step).",
                    "fa_icon": "fas fa-toggle-on"
                },
                "run_malt": {
@ -513,34 +514,39 @@
                "standardisation_taxpasta_format": {
                    "type": "string",
                    "default": "tsv",
-                    "fa_icon": "fas fa-file",
+                    "fa_icon": "fas fa-pastafarianism",
                    "description": "The desired output format.",
                    "enum": ["tsv", "csv", "arrow", "parquet", "biom"]
                },
                "taxpasta_taxonomy_dir": {
                    "type": "string",
                    "description": "The path to a directory containing taxdump files.",
-                    "help_text": "This arguments provides the path to the directory containing taxdump files. At least nodes.dmp and names.dmp are required. A merged.dmp file is optional. \n\nModifies tool parameter(s):\n-taxpasta: `--taxpasta_taxonomy_dir`"
+                    "help_text": "This arguments provides the path to the directory containing taxdump files. At least nodes.dmp and names.dmp are required. A merged.dmp file is optional. \n\nModifies tool parameter(s):\n-taxpasta: `--taxpasta_taxonomy_dir`",
+                    "fa_icon": "fas fa-tree"
                },
                "taxpasta_add_name": {
                    "type": "boolean",
                    "description": "Add the taxon name to the output.",
-                    "help_text": "The standard output format of taxpasta is a two-column table including the read counts and the integer taxonomic ID. The taxon name can be added as additional information to the output table.\n\nModifies tool parameter(s):\n- taxpasta: `--taxpasta_add_name`"
+                    "help_text": "The standard output format of taxpasta is a two-column table including the read counts and the integer taxonomic ID. The taxon name can be added as additional information to the output table.\n\nModifies tool parameter(s):\n- taxpasta: `--taxpasta_add_name`",
+                    "fa_icon": "fas fa-tag"
                },
                "taxpasta_add_rank": {
                    "type": "boolean",
                    "description": "Add the taxon rank to the output.",
-                    "help_text": "The standard output format of taxpasta is a two-column table including the read counts and the integer taxonomic ID. The taxon rank can be added as additional information to the output table.\n\nModifies tool parameter(s):\n- taxpasta: `--taxpasta_add_rank`"
+                    "help_text": "The standard output format of taxpasta is a two-column table including the read counts and the integer taxonomic ID. The taxon rank can be added as additional information to the output table.\n\nModifies tool parameter(s):\n- taxpasta: `--taxpasta_add_rank`",
+                    "fa_icon": "fas fa-sort-amount-down-alt"
                },
                "taxpasta_add_lineage": {
                    "type": "boolean",
-                    "description": "Add the taxon's entire lineage to the output.",
-                    "help_text": "\nThe standard output format of taxpasta is a two-column table including the read counts and the integer taxonomic ID. The taxon's entire lineage with the taxon names separated by semi-colons can be added as additional information to the output table.\n\nModifies tool parameter(s):\n- taxpasta: `--taxpasta_add_lineage`\n"
+                    "description": "Add the taxon's entire name lineage to the output.",
+                    "help_text": "\nThe standard output format of taxpasta is a two-column table including the read counts and the integer taxonomic ID. The taxon's entire lineage with the taxon names separated by semi-colons can be added as additional information to the output table.\n\nModifies tool parameter(s):\n- taxpasta: `--taxpasta_add_lineage`\n",
+                    "fa_icon": "fas fa-link"
                },
                "taxpasta_add_idlineage": {
                    "type": "boolean",
-                    "description": "Add the taxon's entire lineage to the output.",
-                    "help_text": "\nThe standard output format of taxpasta is a two-column table including the read counts and the integer taxonomic ID. The taxon's entire lineage with the taxon identifiers separated by semi-colons can be added as additional information to the output table.\n\nModifies tool parameter(s):\n- taxpasta: `--taxpasta_add_idlineage`\n"
+                    "description": "Add the taxon's entire ID lineage to the output.",
+                    "help_text": "\nThe standard output format of taxpasta is a two-column table including the read counts and the integer taxonomic ID. The taxon's entire lineage with the taxon identifiers separated by semi-colons can be added as additional information to the output table.\n\nModifies tool parameter(s):\n- taxpasta: `--taxpasta_add_idlineage`\n",
+                    "fa_icon": "fas fa-link"
                }
            },
            "fa_icon": "fas fa-chart-line"
--- a/subworkflows/local/profiling.nf
+++ b/subworkflows/local/profiling.nf
@ -117,12 +117,11 @@ workflow PROFILING {

    }

-    if ( params.run_kraken2 ) {
+    if ( params.run_kraken2 || params.run_bracken ) {
        // Have to pick first element of db_params if using bracken,
        // as db sheet for bracken must have ; sep list to
        // distinguish between kraken and bracken parameters
        ch_input_for_kraken2 = ch_input_for_profiling.kraken2
-                                .dump(tag: "ch_input_for_kraken2_b4")
                                .map {
                                    meta, reads, db_meta, db ->
                                        def db_meta_new = db_meta.clone()