1
0
Fork 0
mirror of https://github.com/MillironX/taxprofiler.git synced 2024-11-25 19:59:55 +00:00

Apply suggestions from code review

Co-authored-by: Sofia Stamouli <91951607+sofstam@users.noreply.github.com>
This commit is contained in:
James A. Fellows Yates 2023-01-19 08:49:55 +01:00 committed by GitHub
parent ab0c62bdc1
commit 6d380fbbff
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23

View file

@ -6,16 +6,16 @@
## Introduction ## Introduction
nf-core/taxprofiler is a pipeline for highly-parallelised taxonomic profiling of shotgun metagenomic data across multiple tools simultaneously. In addition to multiple profiling tools, at the same time it allows you to performing profiling across multiple databases and settings per tool, as well as produces standardised output tables to allow immediate cross comparison of results between tools. nf-core/taxprofiler is a pipeline for highly-parallelised taxonomic classification and profiling of shotgun metagenomic data across multiple tools simultaneously. In addition to multiple classification and profiling tools, at the same time it allows you to performing taxonomic classification and profiling across multiple databases and settings per tool, as well as produces standardised output tables to allow immediate cross comparison of results between tools.
To run nf-core/taxprofiler, at a minimum two you require two inputs: To run nf-core/taxprofiler, at a minimum two you require two inputs:
- an sequenceing read samplesheet - a sequencing read samplesheet
- a database samplesheet - a database samplesheet
Both contain metadata and paths to the data of your input samples and database. Both contain metadata and paths to the data of your input samples and databases.
When running nf-core/taxprofiler, every step and tool is 'opt in'. To run a given profiler you must make sure to supply both a database in your `<database>.csv` and supply `--run_<profiler>` flag to your command. Omitting either will result in the profiling tool not executing. When running nf-core/taxprofiler, every step and tool is 'opt in'. To run a given classifier or profiler you must make sure to supply both a database in your `<database>.csv` and supply `--run_<profiler>` flag to your command. Omitting either will result in the profiling tool not executing.
nf-core/profiler also includes optional pre-processing (adapter clipping, merge running etc.) or post-processing (visualisation) steps. These are also opt in with a `--perform_<step>` flag. In some cases, the pre- and post-processing steps may also require additional files. Please check the parameters tab of this documentation for more information. nf-core/profiler also includes optional pre-processing (adapter clipping, merge running etc.) or post-processing (visualisation) steps. These are also opt in with a `--perform_<step>` flag. In some cases, the pre- and post-processing steps may also require additional files. Please check the parameters tab of this documentation for more information.
@ -160,7 +160,7 @@ nf-core/taxprofiler offers four main preprocessing steps for preprocessing raw s
Raw sequencing read processing in the form of adapter clipping and paired-end read merging can be activated via the `--perform_shortread_qc` or `--perform_longread_qc` flags. Raw sequencing read processing in the form of adapter clipping and paired-end read merging can be activated via the `--perform_shortread_qc` or `--perform_longread_qc` flags.
It is highly recommended to run this on raw reads to remove artifacts from sequencing that can cause false positive identification of taxa (e.g. contaminated reference genomes) and/or skews in taxonomic abundance profiles. If you have public data, normally these should have been corrected for, however you should still check these steps have indeed been already performed. It is highly recommended to run this on raw reads to remove artifacts from sequencing that can cause false positive identification of taxa (e.g. contaminated reference genomes) and/or skews in taxonomic abundance profiles. If you have public data, normally these should have been corrected for, however you should still check that these steps have indeed been already performed.
There are currently two options for short-read preprocessing: [`fastp`](https://github.com/OpenGene/fastp) or [`adapterremoval`](https://github.com/MikkelSchubert/adapterremoval). There are currently two options for short-read preprocessing: [`fastp`](https://github.com/OpenGene/fastp) or [`adapterremoval`](https://github.com/MikkelSchubert/adapterremoval).
@ -213,7 +213,7 @@ You can optionally save the FASTQ output of the run merging with the `--save_run
#### Profiling #### Profiling
The following sections provides tips and suggestions for running the different metagenomic taxonomic profiling tools _within the pipeline_. For advice and/or guidance whether you should run a particular tool on your specific data, please see the documentation of each tool! The following sections provide tips and suggestions for running the different taxonomic classification and profiling tools _within the pipeline_. For advice and/or guidance whether you should run a particular tool on your specific data, please see the documentation of each tool!
Not all tools currently have dedicated tips, suggestions and/or recommendations, however we welcome further contributions for existing and additional tools via pull requests to the [nf-core/taxprofiler repository](https://github.com/nf-core/taxprofiler)! Not all tools currently have dedicated tips, suggestions and/or recommendations, however we welcome further contributions for existing and additional tools via pull requests to the [nf-core/taxprofiler repository](https://github.com/nf-core/taxprofiler)!
@ -280,7 +280,7 @@ nf-core/taxprofiler supports generation of Krona interactive piechart plots for
In addition to per-sample profiles, the pipeline also supports generation of 'native' multi-sample taxonomic profiles (i.e., those generated by the taxonomic profiling tools themselves or additional utility scripts provided by the tool authors). In addition to per-sample profiles, the pipeline also supports generation of 'native' multi-sample taxonomic profiles (i.e., those generated by the taxonomic profiling tools themselves or additional utility scripts provided by the tool authors).
This are executed on a per-database level. I.e., you will get a multi-sample taxon table for each database you provide for each tool and will be placed in the same directory as the directories containing the per-sample profiles. These are executed on a per-database level. I.e., you will get a multi-sample taxon table for each database you provide for each tool and will be placed in the same directory as the directories containing the per-sample profiles.
The following tools will produce multi-sample taxon tables: The following tools will produce multi-sample taxon tables:
@ -503,7 +503,7 @@ The following tutorials assumes you already have the tool available (e.g. instal
#### Bracken custom database #### Bracken custom database
Bracken does not require an indepndent database construction, but rather builds upon Kraken2 databases. See [Kraken2](#kraken2-custom-database) for more information on how to build these. Bracken does not require an independent database construction, but rather builds upon Kraken2 databases. See [Kraken2](#kraken2-custom-database) for more information on how to build these.
In addition to a Kraken2 database, you also need to have the (average) read lengths (in bp) of your sequencing experiment, the K-mer size used to build the Kraken2 database, and Kraken2 available on your machine. In addition to a Kraken2 database, you also need to have the (average) read lengths (in bp) of your sequencing experiment, the K-mer size used to build the Kraken2 database, and Kraken2 available on your machine.
@ -536,7 +536,7 @@ You can follow Bracken [tutorial](https://ccb.jhu.edu/software/bracken/index.sht
To build a custom Centrifuge database, a user needs to download taxonomy files, make a custom `seqid2taxid.map` and combine the fasta files together. To build a custom Centrifuge database, a user needs to download taxonomy files, make a custom `seqid2taxid.map` and combine the fasta files together.
In total You need four components: a tab-separated file mapping sequence IDs to taxonomy IDs (`--conversion-table`), a tab-separated file mapping taxonomy IDs to their parents and rank, up to the root of the tree (`--taxonomy-tree`), a pipe-separated file mapping taxonomy IDs to a name (`--name-table`), and the reference sequences. In total, you need four components: a tab-separated file mapping sequence IDs to taxonomy IDs (`--conversion-table`), a tab-separated file mapping taxonomy IDs to their parents and rank, up to the root of the tree (`--taxonomy-tree`), a pipe-separated file mapping taxonomy IDs to a name (`--name-table`), and the reference sequences.
An example of custom `seqid2taxid.map`: An example of custom `seqid2taxid.map`:
@ -596,7 +596,7 @@ A detailed description can be found [here](https://github.com/bbuchfink/diamond/
#### Kaiju custom database #### Kaiju custom database
To build a kaiju database, you need three components: a FASTA file with the protein sequences (the headers are the numeric NCBI taxon identifiers of the protein sequences), and you need to define the uppercase characters of the standard 20 amino acids you wish to include. To build a kaiju database, you need two components: a FASTA file with the protein sequences (the headers are the numeric NCBI taxon identifiers of the protein sequences), and you need to define the uppercase characters of the standard 20 amino acids you wish to include.
```bash ```bash
kaiju-mkbwt -a ACDEFGHIKLMNPQRSTVWY -o proteins proteins.faa kaiju-mkbwt -a ACDEFGHIKLMNPQRSTVWY -o proteins proteins.faa