From e49668005f997c279490c68cc1b0808d080e01f9 Mon Sep 17 00:00:00 2001 From: James Fellows Yates Date: Sun, 12 Mar 2023 08:35:45 +0100 Subject: [PATCH 1/5] Update docs based on feedback and add missing results directories --- docs/output.md | 44 +++++++++++++++++++++++++++------ docs/usage.md | 4 ++- nextflow_schema.json | 36 ++++++++++++++++----------- subworkflows/local/profiling.nf | 3 +-- 4 files changed, 61 insertions(+), 26 deletions(-) diff --git a/docs/output.md b/docs/output.md index 357194c..bddd755 100644 --- a/docs/output.md +++ b/docs/output.md @@ -35,14 +35,18 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d - [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline - [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution +![](images/taxprofiler_tube.png) + ### FastQC or Falco
Output files -- `fastqc/` - - `*_fastqc.html`: FastQC or Falco report containing quality metrics. - - `*_fastqc.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images (FastQC only). +- `{fastqc,falco}/` + - {raw,preprocessed} + - `*html`: FastQC or Falco report containing quality metrics in HTML format. + - `*.txt`: FastQC or Falco report containing quality metrics in TXT format. + - `*.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images (FastQC only).
@@ -186,9 +190,12 @@ It is used with nf-core/taxprofiler to allow removal of 'host' (e.g. human) and/ Output files - `bowtie2/` - - `.bam`: BAM file containing reads that aligned against the user-supplied reference genome as well as unmapped reads - - `.bowtie2.log`: log file about the mapped reads - - `.unmapped.fastq.gz`: the off-target reads from the mapping that is used in downstream steps. + - `build/` + - `*.bt2`: Bowtie2 indicies of reference genome, only if `--save_hostremoval_index` supplied. + - `align/` + - `.bam`: BAM file containing reads that aligned against the user-supplied reference genome as well as unmapped reads + - `.bowtie2.log`: log file about the mapped reads + - `.unmapped.fastq.gz`: the off-target reads from the mapping that is used in downstream steps. @@ -210,7 +217,10 @@ It is used with nf-core/taxprofiler to allow removal of 'host' (e.g. human) or o Output files - `minimap2` - - `.bam`: Alignment file in BAM format containing both mapped and unmapped reads. + - `build/` + - `*.mmi2`: minimap2 indicies of reference genome, only if `--save_hostremoval_index` supplied. + - `align/` + - `.bam`: Alignment file in BAM format containing both mapped and unmapped reads. @@ -243,13 +253,31 @@ This directory will be present and contain the unmapped reads from the `.fastq`
Output files -- `samtoolsstats` +- `samtools/stats` - `.stats`: File containing samtools stats output.
In most cases you do not need to check this file, as it is rendered in the MultiQC run report. +### Run Merging + +nf-core/taxprofiler offers the option to merge FASTQ files of multiple sequencing runs or libraries that derive from the same sample, as specified in the input samplesheet. + +This is the last preprocessing step, so if you have multiple runs or libraries (and run merging turned on), this will represent the final reads that will go into classification/profiling steps. + +
+Output files + +- `run_merging/` + - `*.fastq.gz`: Concatenated FASTQ files on a per-sample basis + +
+ +Note that you will only find samples that went through the run merging step in this directory. For samples that had a single run or library will not go through this step of the pipeline and thus will not be present in this directory. + +⚠️ You must make sure to turn on the saving of the reads from the previous preprocessing step you may have turned on, if you have single-run or library reads in your pipeline run, and wish to save the final reads that go into classification/profiling! + ### Bracken [Bracken](https://ccb.jhu.edu/software/bracken/) (Bayesian Reestimation of Abundance with Kraken) is a highly accurate statistical method that computes the abundance of species in DNA sequences from a metagenomics sample. Braken uses the taxonomy labels assigned by Kraken, a highly accurate metagenomics classification algorithm, to estimate the number of reads originating from each species present in a sample. diff --git a/docs/usage.md b/docs/usage.md index 10c5ce6..a38931e 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -89,7 +89,9 @@ The pipeline takes the paths and specific classification/profiling parameters of > ⚠️ To allow user freedom, nf-core/taxprofiler does not check for mandatory or the validity of non-file database parameters for correct execution of the tool - excluding options offered via pipeline level parameters! Please validate your database parameters (cross-referencing [parameters](https://nf-co.re/taxprofiler/parameters, and the given tool documentation) before submitting the database sheet! For example, if you don't use the default read length - Bracken will require `-r ` in the `db_params` column. -An example database sheet can look as follows, where 5 tools are being used, and `malt` and `kraken2` will be used against two databases each. This is because specifying `bracken` implies first running `kraken2` on the same database. +An example database sheet can look as follows, where 7 tools are being used, and `malt` and `kraken2` will be used against two databases each. + +`kraken2` will be run twice even though only having a single 'dedicated' database because specifying `bracken` implies first running `kraken2` on the `bracken` database, as required by `bracken`. ```console tool,db_name,db_params,db_path diff --git a/nextflow_schema.json b/nextflow_schema.json index 2a72303..417549e 100644 --- a/nextflow_schema.json +++ b/nextflow_schema.json @@ -67,7 +67,7 @@ "save_preprocessed_reads": { "type": "boolean", "fa_icon": "fas fa-save", - "description": "Save reads from adapter clipping/pair-merging, length filtering for both short and long reads", + "description": "Save reads from samples that went through the adapter clipping, pair-merging, and length filtering steps for both short and long reads", "help_text": "This saves the FASTQ output from the following tools:\n\n- fastp\n- AdapterRemoval\n- Porechop\n- Filtlong\n\nThese reads will be a mixture of: adapter clipped, quality trimmed, pair-merged, and length filtered, depending on the parameters you set." } }, @@ -116,7 +116,8 @@ "type": "string", "default": "None", "description": "Specify a list of all possible adapters to trim. Overrides --shortread_qc_adapter1/2. Formats: .txt (AdapterRemoval) or .fasta. (fastp).", - "help_text": "Allows to supply a file with a list of adapter (combinations) to remove from all files. \n\nOverrides the --shortread_qc_adapter1/--shortread_qc_adapter2 parameters . \n\nFor AdapterRemoval this consists of a two column table with a `.txt` extension: first column represents forward strand, second column for reverse strand. You must supply all possible combinations, one per line, and this list is applied to all files. See AdapterRemoval documentation for more information.\n\nFor fastp this consists of a standard FASTA format with a `.fasta`/`.fa`/`.fna`/`.fas` extension. The adapter sequence in this file should be at least 6bp long, otherwise it will be skipped. fastp trims the adapters present in the FASTA file one by one.\n\n> Modifies AdapterRemoval parameter: --adapter-list\n> Modifies fastp parameter: --adapter_fasta" + "help_text": "Allows to supply a file with a list of adapter (combinations) to remove from all files. \n\nOverrides the --shortread_qc_adapter1/--shortread_qc_adapter2 parameters . \n\nFor AdapterRemoval this consists of a two column table with a `.txt` extension: first column represents forward strand, second column for reverse strand. You must supply all possible combinations, one per line, and this list is applied to all files. See AdapterRemoval documentation for more information.\n\nFor fastp this consists of a standard FASTA format with a `.fasta`/`.fa`/`.fna`/`.fas` extension. The adapter sequence in this file should be at least 6bp long, otherwise it will be skipped. fastp trims the adapters present in the FASTA file one by one.\n\n> Modifies AdapterRemoval parameter: --adapter-list\n> Modifies fastp parameter: --adapter_fasta", + "fa_icon": "fas fa-th-list" }, "shortread_qc_mergepairs": { "type": "boolean", @@ -194,7 +195,7 @@ "save_complexityfiltered_reads": { "type": "boolean", "fa_icon": "fas fa-save", - "description": "Save complexity filtered short-reads", + "description": "Save reads from samples that went through the complexity filtering step", "help_text": "Specify whether to save the final complexity filtered reads in your results directory (`--outdir`)." } }, @@ -302,7 +303,7 @@ "save_hostremoval_unmapped": { "type": "boolean", "fa_icon": "fas fa-save", - "description": "Save unmapped reads in FASTQ format from host removal", + "description": "Save reads from samples that went through the host-removal step", "help_text": "Save only the reads NOT mapped to the reference genome in FASTQ format (as exported from `samtools view` and `bam2fq`).\n\nThis can be useful if you wish to perform other analyses on the off-target reads from the host mapping, such as manual profiling or _de novo_ assembly." } }, @@ -323,8 +324,8 @@ "save_runmerged_reads": { "type": "boolean", "fa_icon": "fas fa-save", - "description": "Save run-concatenated input FASTQ files for each sample", - "help_text": "Save the run- and library-concatenated reads of a given sample in FASTQ format." + "description": "Save reads from samples that went through the run-merging step", + "help_text": "Save the run- and library-concatenated reads of a given sample in FASTQ format.\n\n> \u26a0\ufe0f Only samples that went through the run-merging step of the pipeline will be stored in the resulting directory. \n\nIf you wish to save the files that go to the classification/profiling steps for samples that _did not_ go through run merging, you must supply the appropriate upstream `--save_` flag.\n\n" } }, "fa_icon": "fas fa-clipboard-check" @@ -427,7 +428,7 @@ }, "run_bracken": { "type": "boolean", - "description": "Post-process kraken2 reports with Bracken.", + "description": "Turn on Bracken (and the required Kraken2 prerequisite step).", "fa_icon": "fas fa-toggle-on" }, "run_malt": { @@ -513,34 +514,39 @@ "standardisation_taxpasta_format": { "type": "string", "default": "tsv", - "fa_icon": "fas fa-file", + "fa_icon": "fas fa-pastafarianism", "description": "The desired output format.", "enum": ["tsv", "csv", "arrow", "parquet", "biom"] }, "taxpasta_taxonomy_dir": { "type": "string", "description": "The path to a directory containing taxdump files.", - "help_text": "This arguments provides the path to the directory containing taxdump files. At least nodes.dmp and names.dmp are required. A merged.dmp file is optional. \n\nModifies tool parameter(s):\n-taxpasta: `--taxpasta_taxonomy_dir`" + "help_text": "This arguments provides the path to the directory containing taxdump files. At least nodes.dmp and names.dmp are required. A merged.dmp file is optional. \n\nModifies tool parameter(s):\n-taxpasta: `--taxpasta_taxonomy_dir`", + "fa_icon": "fas fa-tree" }, "taxpasta_add_name": { "type": "boolean", "description": "Add the taxon name to the output.", - "help_text": "The standard output format of taxpasta is a two-column table including the read counts and the integer taxonomic ID. The taxon name can be added as additional information to the output table.\n\nModifies tool parameter(s):\n- taxpasta: `--taxpasta_add_name`" + "help_text": "The standard output format of taxpasta is a two-column table including the read counts and the integer taxonomic ID. The taxon name can be added as additional information to the output table.\n\nModifies tool parameter(s):\n- taxpasta: `--taxpasta_add_name`", + "fa_icon": "fas fa-tag" }, "taxpasta_add_rank": { "type": "boolean", "description": "Add the taxon rank to the output.", - "help_text": "The standard output format of taxpasta is a two-column table including the read counts and the integer taxonomic ID. The taxon rank can be added as additional information to the output table.\n\nModifies tool parameter(s):\n- taxpasta: `--taxpasta_add_rank`" + "help_text": "The standard output format of taxpasta is a two-column table including the read counts and the integer taxonomic ID. The taxon rank can be added as additional information to the output table.\n\nModifies tool parameter(s):\n- taxpasta: `--taxpasta_add_rank`", + "fa_icon": "fas fa-sort-amount-down-alt" }, "taxpasta_add_lineage": { "type": "boolean", - "description": "Add the taxon's entire lineage to the output.", - "help_text": "\nThe standard output format of taxpasta is a two-column table including the read counts and the integer taxonomic ID. The taxon's entire lineage with the taxon names separated by semi-colons can be added as additional information to the output table.\n\nModifies tool parameter(s):\n- taxpasta: `--taxpasta_add_lineage`\n" + "description": "Add the taxon's entire name lineage to the output.", + "help_text": "\nThe standard output format of taxpasta is a two-column table including the read counts and the integer taxonomic ID. The taxon's entire lineage with the taxon names separated by semi-colons can be added as additional information to the output table.\n\nModifies tool parameter(s):\n- taxpasta: `--taxpasta_add_lineage`\n", + "fa_icon": "fas fa-link" }, "taxpasta_add_idlineage": { "type": "boolean", - "description": "Add the taxon's entire lineage to the output.", - "help_text": "\nThe standard output format of taxpasta is a two-column table including the read counts and the integer taxonomic ID. The taxon's entire lineage with the taxon identifiers separated by semi-colons can be added as additional information to the output table.\n\nModifies tool parameter(s):\n- taxpasta: `--taxpasta_add_idlineage`\n" + "description": "Add the taxon's entire ID lineage to the output.", + "help_text": "\nThe standard output format of taxpasta is a two-column table including the read counts and the integer taxonomic ID. The taxon's entire lineage with the taxon identifiers separated by semi-colons can be added as additional information to the output table.\n\nModifies tool parameter(s):\n- taxpasta: `--taxpasta_add_idlineage`\n", + "fa_icon": "fas fa-link" } }, "fa_icon": "fas fa-chart-line" diff --git a/subworkflows/local/profiling.nf b/subworkflows/local/profiling.nf index e9440b3..d328a9c 100644 --- a/subworkflows/local/profiling.nf +++ b/subworkflows/local/profiling.nf @@ -117,12 +117,11 @@ workflow PROFILING { } - if ( params.run_kraken2 ) { + if ( params.run_kraken2 || params.run_bracken ) { // Have to pick first element of db_params if using bracken, // as db sheet for bracken must have ; sep list to // distinguish between kraken and bracken parameters ch_input_for_kraken2 = ch_input_for_profiling.kraken2 - .dump(tag: "ch_input_for_kraken2_b4") .map { meta, reads, db_meta, db -> def db_meta_new = db_meta.clone() From 95f9bf1fa9bb3ca021e4d9ef1dee6e8174f3451c Mon Sep 17 00:00:00 2001 From: James Fellows Yates Date: Sun, 12 Mar 2023 08:43:56 +0100 Subject: [PATCH 2/5] Update credits section --- README.md | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 3d18227..676a3da 100644 --- a/README.md +++ b/README.md @@ -77,11 +77,16 @@ The nf-core/taxprofiler pipeline comes with documentation about the pipeline [us ## Credits -nf-core/taxprofiler was originally written by nf-core community. +nf-core/taxprofiler was originally written by [James A. Fellows Yates](https://github.com/jfy133), [Moritz Beber](https://github.com/Midnighter), and [Sofia Stamouli](https://github.com/sofsam). -We thank the following people for their extensive assistance in the development of this pipeline: +We thank the following people for their contributions to the development of this pipeline: -[James A. Fellows Yates](https://github.com/jfy133), [Moritz Beber](https://github.com/Midnighter), [Lauri Mesilaakso](https://github.com/ljmesi), [Sofia Stamouli](https://github.com/sofsam), [Maxime Borry](https://github.com/maxibor),[Thomas A. Christensen II](https://github.com/MillironX), [Jianhong Ou](https://github.com/jianhong), [Rafal Stepien](https://github.com/rafalstepien), [Mahwash Jamy](https://github.com/mjamy). +[Lauri Mesilaakso](https://github.com/ljmesi), [Tanja Normark](https://github.com/talnor), [Maxime Borry](https://github.com/maxibor),[Thomas A. Christensen II](https://github.com/MillironX), [Jianhong Ou](https://github.com/jianhong), [Rafal Stepien](https://github.com/rafalstepien), [Mahwash Jamy](https://github.com/mjamy), and the [nf-core/community](https://nf-co.re/community). + +We also are grateful for the feedback and comments from: + +- [Alex Hübner](https://github.com/alexhbnr) +- Lili Andersson-Li ## Contributions and Support From 6266ee782dc8240933554238aa2e38ab90e5d7e1 Mon Sep 17 00:00:00 2001 From: "James A. Fellows Yates" Date: Sun, 12 Mar 2023 11:08:31 +0100 Subject: [PATCH 3/5] Update README.md Co-authored-by: Sofia Stamouli <91951607+sofstam@users.noreply.github.com> --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 676a3da..a6dc165 100644 --- a/README.md +++ b/README.md @@ -86,7 +86,7 @@ We thank the following people for their contributions to the development of this We also are grateful for the feedback and comments from: - [Alex Hübner](https://github.com/alexhbnr) -- Lili Andersson-Li +- [LilyAnderssonLee](https://github.com/LilyAnderssonLee) ## Contributions and Support From 7f6e162e1cac133104203636f2005f99f8e18b87 Mon Sep 17 00:00:00 2001 From: "James A. Fellows Yates" Date: Sun, 12 Mar 2023 11:08:37 +0100 Subject: [PATCH 4/5] Update docs/output.md Co-authored-by: Sofia Stamouli <91951607+sofstam@users.noreply.github.com> --- docs/output.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/output.md b/docs/output.md index bddd755..4ef6770 100644 --- a/docs/output.md +++ b/docs/output.md @@ -218,7 +218,7 @@ It is used with nf-core/taxprofiler to allow removal of 'host' (e.g. human) or o - `minimap2` - `build/` - - `*.mmi2`: minimap2 indicies of reference genome, only if `--save_hostremoval_index` supplied. + - `*.mmi2`: minimap2 indices of reference genome, only if `--save_hostremoval_index` supplied. - `align/` - `.bam`: Alignment file in BAM format containing both mapped and unmapped reads. From 9550c76c2554b6b2a24c681631ee769deb07fe21 Mon Sep 17 00:00:00 2001 From: "James A. Fellows Yates" Date: Sun, 12 Mar 2023 11:11:05 +0100 Subject: [PATCH 5/5] Update README.md --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index a6dc165..ba08835 100644 --- a/README.md +++ b/README.md @@ -88,6 +88,8 @@ We also are grateful for the feedback and comments from: - [Alex Hübner](https://github.com/alexhbnr) - [LilyAnderssonLee](https://github.com/LilyAnderssonLee) +Credit and thanks also goes to [Zandra Fagernäs](https://github.com/ZandraFagernas) for the logo. + ## Contributions and Support If you would like to contribute to this pipeline, please see the [contributing guidelines](.github/CONTRIBUTING.md).