Expected (uncompressed) database files for each tool are as follows:
Expected (uncompressed) database files for each tool are as follows:
## Running the pipeline
@ -489,8 +445,121 @@ NXF_OPTS='-Xms1g -Xmx4g'
Here we will give brief guidance on how to build databases for each supported taxonomic profiler. You should always consult the documentation of each toolfor more information, how we provide these as quick reference guides.
The following tutorial assumes you already have the tool available (e.g. installed locally, or via conda, docker etc.), and you have already downloaded the FASTA files you wish to build into a database.
#### Bracken
<details markdown="1">
<summary>Output files</summary>
- `bracken`
- `hash.k2d`
- `opts.k2d`
- `taxo.k2d`
- `database.kraken`
- `database100mers.kmer_distrib`
- `database100mers.kraken`
- `database150mers.kmer_distrib`
- `database150mers.kraken`
Bracken does not provide any default databases for profiling, but rather building upon Kraken2 databases. See [Kraken2](#kraken2) for more information on how to build these.
In addition to a Kraken2 database, you also need to have the (average) read lengths (in bp) of your sequencing experiment, the K-mer size used to build the Kraken2 database, and Kraken2 available on your machine.
> 🛈 You can speed up database construction by supplying the threads parameter (`-t`).
> 🛈 If you do not have Kraken2 in your `$PATH` you can point to the binary with `-x /<path>/<to>/kraken2`.
You can follow Bracken [tutorial](https://ccb.jhu.edu/software/bracken/index.shtml?t=manual) for more information. Alternatively, you can use one of the indexes that can be found [here](https://benlangmead.github.io/aws-indexes/k2).
#### Centrifuge
<details markdown="1">
<summary>Output files</summary>
- `centrifuge`
- `<database_name>.<number>.cf`
- `<database_name>.<number>.cf`
- `<database_name>.<number>.cf`
- `<database_name>.<number>.cf`
Centrifuge allows the user to [build custom databases](https://ccb.jhu.edu/software/centrifuge/manual.shtml#custom-database). The user should download taxonomy files, make custom `seqid2taxid.map` and combine the fasta files together.
centrifuge-download -o taxonomy taxonomy
## custom seqid2taxid.map
NC_001133.9 4392
NC_012920.1 9606
NC_001134.8 4392
NC_001135.5 4392
cat *.{fa,fna} > input-sequences.fna
centrifuge-build -p 4 --conversion-table seqid2taxid.map --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp input-sequences.fna taxprofiler_cf
<details markdown="1">
<summary>Output files</summary>
- `diamond`
- `<database_name>.dmnd`
To create a custom database for DIAMOND, the user should download and unzip the NCBI's taxonomy files. The `makedb` needs to be executed afterwards. A detailed description can be found [here](https://github.com/bbuchfink/diamond/wiki/1.-Tutorial)
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip
unzip taxdmp.zip
## warning: large file!
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.FULL.gz
## warning: takes a long time!
cat ../raw/*.faa | diamond makedb -d testdb-diamond --taxonmap prot.accession2taxid.FULL.gz --taxonnodes nodes.dmp --taxonnames names.dmp
rm *dmp *txt *gz *prt *zip
#### Kaiju
<details markdown="1">
<summary>Output files</summary>
- `kaiju`
- `kaiju_db_*.fmi`
- `nodes.dmp`
- `names.dmp`
It is possible to [create custom databases](https://github.com/bioinformatics-centre/kaiju#custom-database) with Kaiju.
kaiju-mkbwt -n 5 -a ACDEFGHIKLMNPQRSTVWY -o proteins proteins.faa
kaiju-mkfmi proteins
#### Kraken2
<details markdown="1">
<summary>Output files</summary>
- `kraken2`
- `opts.k2d`
- `hash.k2d`
- `taxo.k2d`
> These are instructions are based on Kraken 2.1.2
> To build a Kraken2 database you need two components: a taxonomy (consisting of `names.dmp`, `nodes.dmp`, and `*accession2taxid`) files, and the FASTA files you wish to include.
> To install pull the NCBI taxonomy you can run the following:
@ -524,34 +593,41 @@ You can then add the <YOUR_DB_NAME>/ path to your nf-core/taxprofiler database i
You can follow the Kraken2 [tutorial](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown#custom-databases) for a more detailed description.
#### Centrifuge
#### KrakenUniq
Centrifuge allows the user to [build custom databases](https://ccb.jhu.edu/software/centrifuge/manual.shtml#custom-database). The user should download taxonomy files, make custom `seqid2taxid.map` and combine the fasta files together.
<details markdown="1">
<summary>Output files</summary>
centrifuge-download -o taxonomy taxonomy
- `krakenuniq`
- `opts.k2d`
- `hash.k2d`
- `taxo.k2d`
- `database.idx`
- `taxDB`
## custom seqid2taxid.map
NC_001133.9 4392
NC_012920.1 9606
NC_001134.8 4392
NC_001135.5 4392
cat *.{fa,fna} > input-sequences.fna
centrifuge-build -p 4 --conversion-table seqid2taxid.map --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp input-sequences.fna taxprofiler_cf
For KrakenUniq, we recommend using one of the available databases [here](https://benlangmead.github.io/aws-indexes/k2). But if you wish to build your own, please see the [documentation](https://github.com/fbreitwieser/krakenuniq/blob/master/README.md#custom-databases-with-ncbi-taxonomy).
#### Kaiju
It is possible to [create custom databases](https://github.com/bioinformatics-centre/kaiju#custom-database) with Kaiju.
kaiju-mkbwt -n 5 -a ACDEFGHIKLMNPQRSTVWY -o proteins proteins.faa
kaiju-mkfmi proteins
#### MALT
<details markdown="1">
<summary>Output files</summary>
- `malt`
- `ref.idx`
- `taxonomy.idx`
- `taxonomy.map`
- `index0.idx`
- `table0.idx`
- `table0.db`
- `ref.inf`
- `ref.db`
- `taxonomy.tre`
MALT does not provide any default databases for profiling, therefore you must build your own.
You need FASTA files to include, and an (unzipped) [MEGAN mapping 'db' file](https://software-ab.informatik.uni-tuebingen.de/download/megan6/) for your FASTA type.
In addition to the input directory, output directory, and the mapping file database, you also need to specify the sequence type (DNA or Protein) with the `-s` flag.
@ -568,42 +644,25 @@ MALT-build can be multi-threaded with `-t` to speed up building.
See the [MALT manual](https://software-ab.informatik.uni-tuebingen.de/download/malt/manual.pdf) for more information.
#### Bracken
Bracken does not provide any default databases for profiling, but rather building upon Kraken2 databases. See [Kraken2](#kraken2) for more information on how to build these.
#### MetaPhlAn3
In addition to a Kraken2 database, you also need to have the (average) read lengths (in bp) of your sequencing experiment, the K-mer size used to build the Kraken2 database, and Kraken2 available on your machine.
<details markdown="1">
<summary>Output files</summary>
- `metaphlan3`
- `mpa_v30_CHOCOPhlAn_201901.pkl`
- `mpa_v30_CHOCOPhlAn_201901.pkl`
- `mpa_v30_CHOCOPhlAn_201901.fasta`
- `mpa_v30_CHOCOPhlAn_201901.3.bt2`
- `mpa_v30_CHOCOPhlAn_201901.4.bt2`
- `mpa_v30_CHOCOPhlAn_201901.1.bt2`
- `mpa_v30_CHOCOPhlAn_201901.2.bt2`
- `mpa_v30_CHOCOPhlAn_201901.rev.1.bt2`
- `mpa_v30_CHOCOPhlAn_201901.rev.2.bt2`
- `mpa_latest`
> 🛈 You can speed up database construction by supplying the threads parameter (`-t`).
> 🛈 If you do not have Kraken2 in your `$PATH` you can point to the binary with `-x /<path>/<to>/kraken2`.
You can follow Bracken [tutorial](https://ccb.jhu.edu/software/bracken/index.shtml?t=manual) for more information. Alternatively, you can use one of the indexes that can be found [here](https://benlangmead.github.io/aws-indexes/k2).
#### KrakenUniq
For KrakenUniq, we recommend using one of the available databases [here](https://benlangmead.github.io/aws-indexes/k2). But if you wish to build your own, please see the [documentation](https://github.com/fbreitwieser/krakenuniq/blob/master/README.md#custom-databases-with-ncbi-taxonomy).
To create a custom database for DIAMOND, the user should download and unzip the NCBI's taxonomy files. The `makedb` needs to be executed afterwards. A detailed description can be found [here](https://github.com/bbuchfink/diamond/wiki/1.-Tutorial)
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip
unzip taxdmp.zip
## warning: large file!
wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.FULL.gz
## warning: takes a long time!
cat ../raw/*.faa | diamond makedb -d testdb-diamond --taxonmap prot.accession2taxid.FULL.gz --taxonnodes nodes.dmp --taxonnames names.dmp
rm *dmp *txt *gz *prt *zip
#### mOTUs