From 4ec8b025bdb436e138230ceda06bcff94585dc01 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?S=C3=A9bastien=20Guizard?= Date: Mon, 27 Sep 2021 16:14:35 +0100 Subject: [PATCH] New module: `LIMA` (#719) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * πŸ“¦ NEW: Add module lima * πŸ‘Œ IMPROVE: Move .pbi output to reports channel * πŸ› FIX: Fix report channel definition * πŸ‘ŒIMPROVE; Remove options from command line update test script with removed options * πŸ‘Œ IMPROVE: Add some pacbio test files * πŸ› FIX: Add Pacbio index to test_data.config * πŸ‘Œ IMPROVE: Re add 10000 data test * πŸ› FIX: Add pbi input * πŸ‘Œ IMPROVE: Add parallelization to lima * πŸ‘Œ IMPROVE: Add some pbindex * πŸ› FIX: Add pbi extension to files * πŸ‘Œ IMPROVE: The accept one channel (primers move into the first channel) * πŸ‘Œ IMPROVE: Assign a value channel for pimers Improve code workflow readability * πŸ‘Œ IMPROVE: Update .gitignore * πŸ‘Œ IMPROVE: Update module to last template version * πŸ› FIX: Correct Singularity and Docker URL * πŸ‘Œ IMPROVE: Update to the last version of modules template * πŸ‘Œ IMPROVE: Update test_data.config * πŸ‘Œ IMPROVE: Remove pbi from input files * πŸ‘Œ IMPROVE: Final version of test datasets config * πŸ‘Œ IMPROVE: Remove useless index + Fix Typos * πŸ› FIX: Fill contains args * πŸ“¦ NEW: Add module lima * πŸ‘Œ IMPROVE: Move .pbi output to reports channel * πŸ› FIX: Fix report channel definition * πŸ‘ŒIMPROVE; Remove options from command line update test script with removed options * πŸ› FIX: Add pbi input * πŸ‘Œ IMPROVE: Add parallelization to lima * πŸ‘Œ IMPROVE: Add some pacbio test files * πŸ› FIX: Add Pacbio index to test_data.config * πŸ‘Œ IMPROVE: Re add 10000 data test * πŸ‘Œ IMPROVE: Add some pbindex * πŸ› FIX: Add pbi extension to files * πŸ‘Œ IMPROVE: The accept one channel (primers move into the first channel) * πŸ‘Œ IMPROVE: Assign a value channel for pimers Improve code workflow readability * πŸ‘Œ IMPROVE: Update .gitignore * πŸ‘Œ IMPROVE: Update module to last template version * πŸ› FIX: Correct Singularity and Docker URL * πŸ‘Œ IMPROVE: Update to the last version of modules template * πŸ‘Œ IMPROVE: Update test_data.config * πŸ‘Œ IMPROVE: Remove pbi from input files * πŸ‘Œ IMPROVE: Final version of test datasets config * πŸ‘Œ IMPROVE: Remove useless index + Fix Typos * πŸ› FIX: Fill contains args * πŸ‘Œ IMPROVE: Add channel for each output * πŸ‘Œ IMPROVE: Remove comments * πŸ“¦ NEW: Add module lima * πŸ‘Œ IMPROVE: Move .pbi output to reports channel * πŸ› FIX: Fix report channel definition * πŸ‘ŒIMPROVE; Remove options from command line update test script with removed options * πŸ› FIX: Add pbi input * πŸ‘Œ IMPROVE: Add parallelization to lima * πŸ‘Œ IMPROVE: Add some pacbio test files * πŸ› FIX: Add Pacbio index to test_data.config * πŸ‘Œ IMPROVE: Re add 10000 data test * πŸ‘Œ IMPROVE: Add some pbindex * πŸ› FIX: Add pbi extension to files * πŸ‘Œ IMPROVE: The accept one channel (primers move into the first channel) * πŸ‘Œ IMPROVE: Assign a value channel for pimers Improve code workflow readability * πŸ‘Œ IMPROVE: Update module to last template version * πŸ› FIX: Correct Singularity and Docker URL * πŸ‘Œ IMPROVE: Update to the last version of modules template * πŸ‘Œ IMPROVE: Update test_data.config * πŸ‘Œ IMPROVE: Remove pbi from input files * πŸ› FIX: Fill contains args * πŸ“¦ NEW: Add module lima * πŸ‘Œ IMPROVE: Move .pbi output to reports channel * πŸ› FIX: Fix report channel definition * πŸ‘ŒIMPROVE; Remove options from command line update test script with removed options * πŸ› FIX: Add pbi input * πŸ‘Œ IMPROVE: Add parallelization to lima * πŸ‘Œ IMPROVE: Add some pacbio test files * πŸ› FIX: Add Pacbio index to test_data.config * πŸ‘Œ IMPROVE: Re add 10000 data test * πŸ‘Œ IMPROVE: Add some pbindex * πŸ› FIX: Add pbi extension to files * πŸ‘Œ IMPROVE: The accept one channel (primers move into the first channel) * πŸ‘Œ IMPROVE: Assign a value channel for pimers Improve code workflow readability * πŸ‘Œ IMPROVE: Update module to last template version * πŸ› FIX: Correct Singularity and Docker URL * πŸ‘Œ IMPROVE: Update to the last version of modules template * πŸ‘Œ IMPROVE: Update test_data.config * πŸ‘Œ IMPROVE: Remove pbi from input files * πŸ‘Œ IMPROVE: Final version of test datasets config * πŸ‘Œ IMPROVE: Remove useless index + Fix Typos * πŸ› FIX: Fill contains args * πŸ‘Œ IMPROVE: Add channel for each output * πŸ‘Œ IMPROVE: Remove comments * πŸ› FIX: Clean test_data.config * Update modules/lima/main.nf Add meta to each output Co-authored-by: Harshil Patel * Update modules/lima/main.nf Remove useless parenthesis Co-authored-by: Harshil Patel * πŸ› FIX: Keep version number only * πŸ› FIX: Reintegrate prefix variable and use it to define output file name * πŸ‘Œ IMPROVE: add suffix arg to check output files names * πŸ‘Œ IMPROVE: Use prefix for output filename * πŸ› FIX: Set optional output Allow usage of different input formats * πŸ‘Œ IMPROVE: Update meta file * πŸ‘Œ IMPROVE: Update test One test for each input file type * πŸ‘Œ IMPROVE: add fasta, fastq.gz, fastq, fastq.gz test files * πŸ‘Œ IMPROVE: Update with last templates / Follow new version.yaml rule * πŸ› FIX: Fix typos and include getProcessName function * πŸ‘Œ IMPROVE: Update .gitignore * πŸ‘Œ IMPROVE: Using suffix to manage output was not a my best idea Add a bash code to detect extension and update output file name * πŸ‘Œ IMPROVE: clean code Co-authored-by: Harshil Patel Co-authored-by: Gregor Sturm Co-authored-by: Mahesh Binzer-Panchal --- .gitignore | 2 + modules/lima/functions.nf | 78 ++++++++++++++++++++++++++++ modules/lima/main.nf | 73 ++++++++++++++++++++++++++ modules/lima/meta.yml | 77 ++++++++++++++++++++++++++++ tests/config/pytest_modules.yml | 4 ++ tests/config/test_data.config | 5 ++ tests/modules/lima/main.nf | 60 ++++++++++++++++++++++ tests/modules/lima/test.yml | 91 +++++++++++++++++++++++++++++++++ 8 files changed, 390 insertions(+) create mode 100644 modules/lima/functions.nf create mode 100644 modules/lima/main.nf create mode 100644 modules/lima/meta.yml create mode 100644 tests/modules/lima/main.nf create mode 100644 tests/modules/lima/test.yml diff --git a/.gitignore b/.gitignore index 9d982e3f..06eae014 100644 --- a/.gitignore +++ b/.gitignore @@ -11,3 +11,5 @@ __pycache__ *.pyo *.pyc tests/data/ +modules/modtest/ +tests/modules/modtest/ diff --git a/modules/lima/functions.nf b/modules/lima/functions.nf new file mode 100644 index 00000000..85628ee0 --- /dev/null +++ b/modules/lima/functions.nf @@ -0,0 +1,78 @@ +// +// Utility functions used in nf-core DSL2 module files +// + +// +// Extract name of software tool from process name using $task.process +// +def getSoftwareName(task_process) { + return task_process.tokenize(':')[-1].tokenize('_')[0].toLowerCase() +} + +// +// Extract name of module from process name using $task.process +// +def getProcessName(task_process) { + return task_process.tokenize(':')[-1] +} + +// +// Function to initialise default values and to generate a Groovy Map of available options for nf-core modules +// +def initOptions(Map args) { + def Map options = [:] + options.args = args.args ?: '' + options.args2 = args.args2 ?: '' + options.args3 = args.args3 ?: '' + options.publish_by_meta = args.publish_by_meta ?: [] + options.publish_dir = args.publish_dir ?: '' + options.publish_files = args.publish_files + options.suffix = args.suffix ?: '' + return options +} + +// +// Tidy up and join elements of a list to return a path string +// +def getPathFromList(path_list) { + def paths = path_list.findAll { item -> !item?.trim().isEmpty() } // Remove empty entries + paths = paths.collect { it.trim().replaceAll("^[/]+|[/]+\$", "") } // Trim whitespace and trailing slashes + return paths.join('/') +} + +// +// Function to save/publish module results +// +def saveFiles(Map args) { + def ioptions = initOptions(args.options) + def path_list = [ ioptions.publish_dir ?: args.publish_dir ] + + // Do not publish versions.yml unless running from pytest workflow + if (args.filename.equals('versions.yml') && !System.getenv("NF_CORE_MODULES_TEST")) { + return null + } + if (ioptions.publish_by_meta) { + def key_list = ioptions.publish_by_meta instanceof List ? ioptions.publish_by_meta : args.publish_by_meta + for (key in key_list) { + if (args.meta && key instanceof String) { + def path = key + if (args.meta.containsKey(key)) { + path = args.meta[key] instanceof Boolean ? "${key}_${args.meta[key]}".toString() : args.meta[key] + } + path = path instanceof String ? path : '' + path_list.add(path) + } + } + } + if (ioptions.publish_files instanceof Map) { + for (ext in ioptions.publish_files) { + if (args.filename.endsWith(ext.key)) { + def ext_list = path_list.collect() + ext_list.add(ext.value) + return "${getPathFromList(ext_list)}/$args.filename" + } + } + } else if (ioptions.publish_files == null) { + return "${getPathFromList(path_list)}/$args.filename" + } +} diff --git a/modules/lima/main.nf b/modules/lima/main.nf new file mode 100644 index 00000000..1ff5ac48 --- /dev/null +++ b/modules/lima/main.nf @@ -0,0 +1,73 @@ +// Import generic module functions +include { initOptions; saveFiles; getSoftwareName; getProcessName } from './functions' + +params.options = [:] +options = initOptions(params.options) + +process LIMA { + tag "$meta.id" + label 'process_medium' + publishDir "${params.outdir}", + mode: params.publish_dir_mode, + saveAs: { filename -> saveFiles(filename:filename, options:params.options, publish_dir:getSoftwareName(task.process), meta:meta, publish_by_meta:['id']) } + + conda (params.enable_conda ? "bioconda::lima=2.2.0" : null) + if (workflow.containerEngine == 'singularity' && !params.singularity_pull_docker_container) { + container "https://depot.galaxyproject.org/singularity/lima:2.2.0--h9ee0642_0" + } else { + container "quay.io/biocontainers/lima:2.2.0--h9ee0642_0" + } + + input: + tuple val(meta), path(ccs) + path primers + + output: + tuple val(meta), path("*.clips") , emit: clips + tuple val(meta), path("*.counts") , emit: counts + tuple val(meta), path("*.guess") , emit: guess + tuple val(meta), path("*.report") , emit: report + tuple val(meta), path("*.summary"), emit: summary + path "versions.yml" , emit: version + + tuple val(meta), path("*.bam") , optional: true, emit: bam + tuple val(meta), path("*.bam.pbi") , optional: true, emit: pbi + tuple val(meta), path("*.{fa, fasta}") , optional: true, emit: fasta + tuple val(meta), path("*.{fa.gz, fasta.gz}"), optional: true, emit: fastagz + tuple val(meta), path("*.fastq") , optional: true, emit: fastq + tuple val(meta), path("*.fastq.gz") , optional: true, emit: fastqgz + tuple val(meta), path("*.xml") , optional: true, emit: xml + tuple val(meta), path("*.json") , optional: true, emit: json + + script: + def prefix = options.suffix ? "${meta.id}${options.suffix}" : "${meta.id}" + + """ + OUT_EXT="" + + if [[ $ccs =~ bam\$ ]]; then + OUT_EXT="bam" + elif [[ $ccs =~ fasta\$ ]]; then + OUT_EXT="fasta" + elif [[ $ccs =~ fasta.gz\$ ]]; then + OUT_EXT="fasta.gz" + elif [[ $ccs =~ fastq\$ ]]; then + OUT_EXT="fastq" + elif [[ $ccs =~ fastq.gz\$ ]]; then + OUT_EXT="fastq.gz" + fi + + echo \$OUT_EXT + lima \\ + $ccs \\ + $primers \\ + $prefix.\$OUT_EXT \\ + -j $task.cpus \\ + $options.args + + cat <<-END_VERSIONS > versions.yml + ${getProcessName(task.process)}: + lima: \$( lima --version | sed 's/lima //g' | sed 's/ (.\\+//g' ) + END_VERSIONS + """ +} diff --git a/modules/lima/meta.yml b/modules/lima/meta.yml new file mode 100644 index 00000000..3bb861b5 --- /dev/null +++ b/modules/lima/meta.yml @@ -0,0 +1,77 @@ +name: lima +description: lima - The PacBio Barcode Demultiplexer and Primer Remover +keywords: + - sort +tools: + - lima: + description: lima - The PacBio Barcode Demultiplexer and Primer Remover + homepage: https://github.com/PacificBiosciences/pbbioconda + documentation: https://lima.how/ + tool_dev_url: https://github.com/pacificbiosciences/barcoding/ + doi: "" + licence: ['BSD-3-clause-Clear'] + +input: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test' ] + - ccs: + type: file + description: A BAM or fasta or fasta.gz or fastq or fastq.gz file of subreads or ccs + pattern: "*.{bam,fasta,fasta.gz,fastq,fastq.gz}" + - primers: + type: file + description: Fasta file, sequences of primers + pattern: "*.fasta" + +output: + - meta: + type: map + description: | + Groovy Map containing sample information + e.g. [ id:'test' ] + - bam: + type: file + description: A bam file of ccs purged of primers + pattern: "*.bam" + - pbi: + type: file + description: Pacbio index file of ccs purged of primers + pattern: "*.bam" + - xml: + type: file + description: An XML file representing a set of a particular sequence data type such as subreads, references or aligned subreads. + pattern: "*.xml" + - json: + type: file + description: A metadata json file + pattern: "*.json" + - clips: + type: file + description: A fasta file of clipped primers + pattern: "*.clips" + - counts: + type: file + description: A tabulated file of describing pairs of primers + pattern: "*.counts" + - guess: + type: file + description: A second tabulated file of describing pairs of primers (no doc available) + pattern: "*.guess" + - report: + type: file + description: A tab-separated file about each ZMW, unfiltered + pattern: "*.report" + - summary: + type: file + description: This file shows how many ZMWs have been filtered, how ZMWs many are same/different, and how many reads have been filtered. + pattern: "*.summary" + - version: + type: file + description: File containing software version + pattern: "*.{version.txt}" + +authors: + - "@sguizard" diff --git a/tests/config/pytest_modules.yml b/tests/config/pytest_modules.yml index 74673511..16d4790d 100644 --- a/tests/config/pytest_modules.yml +++ b/tests/config/pytest_modules.yml @@ -562,6 +562,10 @@ last/train: - modules/last/train/** - tests/modules/last/train/** +lima: + - modules/lima/** + - tests/modules/lima/** + lofreq/call: - modules/lofreq/call/** - tests/modules/lofreq/call/** diff --git a/tests/config/test_data.config b/tests/config/test_data.config index eda747e0..8b246c7c 100644 --- a/tests/config/test_data.config +++ b/tests/config/test_data.config @@ -175,6 +175,11 @@ params { alz = "${test_data_dir}/genomics/homo_sapiens/pacbio/bam/alz.bam" alzpbi = "${test_data_dir}/genomics/homo_sapiens/pacbio/bam/alz.bam.pbi" ccs = "${test_data_dir}/genomics/homo_sapiens/pacbio/bam/alz.ccs.bam" + ccs_fa = "${test_data_dir}/genomics/homo_sapiens/pacbio/fasta/alz.ccs.fasta" + ccs_fa_gz = "${test_data_dir}/genomics/homo_sapiens/pacbio/fasta/alz.ccs.fasta.gz" + ccs_fq = "${test_data_dir}/genomics/homo_sapiens/pacbio/fastq/alz.ccs.fastq" + ccs_fq_gz = "${test_data_dir}/genomics/homo_sapiens/pacbio/fastq/alz.ccs.fastq.gz" + ccs_xml = "${test_data_dir}/genomics/homo_sapiens/pacbio/xml/alz.ccs.consensusreadset.xml" lima = "${test_data_dir}/genomics/homo_sapiens/pacbio/bam/alz.ccs.fl.NEB_5p--NEB_Clontech_3p.bam" refine = "${test_data_dir}/genomics/homo_sapiens/pacbio/bam/alz.ccs.fl.NEB_5p--NEB_Clontech_3p.flnc.bam" cluster = "${test_data_dir}/genomics/homo_sapiens/pacbio/bam/alz.ccs.fl.NEB_5p--NEB_Clontech_3p.flnc.clustered.bam" diff --git a/tests/modules/lima/main.nf b/tests/modules/lima/main.nf new file mode 100644 index 00000000..df4b2be2 --- /dev/null +++ b/tests/modules/lima/main.nf @@ -0,0 +1,60 @@ +#!/usr/bin/env nextflow + +nextflow.enable.dsl = 2 + +include { LIMA } from '../../../modules/lima/main.nf' addParams( options: [args: '--isoseq --peek-guess', suffix: ".fl"] ) + +workflow test_lima_bam { + + input = [ + [ id:'test' ], // meta map + file(params.test_data['homo_sapiens']['pacbio']['ccs'], checkIfExists: true), + ] + primers = [ file(params.test_data['homo_sapiens']['pacbio']['primers'], checkIfExists: true) ] + + LIMA ( input, primers ) +} + +workflow test_lima_fa { + + input = [ + [ id:'test' ], // meta map + file(params.test_data['homo_sapiens']['pacbio']['ccs_fa'], checkIfExists: true), + ] + primers = [ file(params.test_data['homo_sapiens']['pacbio']['primers'], checkIfExists: true) ] + + LIMA ( input, primers ) +} + +workflow test_lima_fa_gz { + + input = [ + [ id:'test' ], // meta map + file(params.test_data['homo_sapiens']['pacbio']['ccs_fa_gz'], checkIfExists: true), + ] + primers = [ file(params.test_data['homo_sapiens']['pacbio']['primers'], checkIfExists: true) ] + + LIMA ( input, primers ) +} + +workflow test_lima_fq { + + input = [ + [ id:'test' ], // meta map + file(params.test_data['homo_sapiens']['pacbio']['ccs_fq'], checkIfExists: true), + ] + primers = [ file(params.test_data['homo_sapiens']['pacbio']['primers'], checkIfExists: true) ] + + LIMA ( input, primers ) +} + +workflow test_lima_fq_gz { + + input = [ + [ id:'test' ], // meta map + file(params.test_data['homo_sapiens']['pacbio']['ccs_fq_gz'], checkIfExists: true), + ] + primers = [ file(params.test_data['homo_sapiens']['pacbio']['primers'], checkIfExists: true) ] + + LIMA ( input, primers ) +} diff --git a/tests/modules/lima/test.yml b/tests/modules/lima/test.yml new file mode 100644 index 00000000..1ff860d9 --- /dev/null +++ b/tests/modules/lima/test.yml @@ -0,0 +1,91 @@ +- name: lima test_lima_bam + command: nextflow run tests/modules/lima -entry test_lima_bam -c tests/config/nextflow.config + tags: + - lima + files: + - path: output/lima/test.fl.NEB_5p--NEB_Clontech_3p.bam + md5sum: 14b51d7f44e30c05a5b14e431a992097 + - path: output/lima/test.fl.NEB_5p--NEB_Clontech_3p.bam.pbi + md5sum: 6ae7f057304ad17dd9d5f565d72d3f7b + - path: output/lima/test.fl.NEB_5p--NEB_Clontech_3p.consensusreadset.xml + contains: [ 'ConsensusReadSet' ] + - path: output/lima/test.fl.json + contains: [ 'ConsensusReadSet' ] + - path: output/lima/test.fl.lima.clips + md5sum: fa03bc75bd78b2648a139fd67c69208f + - path: output/lima/test.fl.lima.counts + md5sum: 842c6a23ca2de504ced4538ad5111da1 + - path: output/lima/test.fl.lima.guess + md5sum: d3675af3ca8a908ee9e3c231668392d3 + - path: output/lima/test.fl.lima.report + md5sum: dc073985322ae0a003ccc7e0fa4db5e6 + - path: output/lima/test.fl.lima.summary + md5sum: bcbcaaaca418bdeb91141c81715ca420 + +- name: lima test_lima_fa + command: nextflow run tests/modules/lima -entry test_lima_fa -c tests/config/nextflow.config + tags: + - lima + files: + - path: output/lima/test.fl.lima.clips + md5sum: 1012bc8874a14836f291bac48e8482a4 + - path: output/lima/test.fl.lima.counts + md5sum: a4ceaa408be334eaa711577e95f8730e + - path: output/lima/test.fl.lima.guess + md5sum: 651e5f2b438b8ceadb3e06a2177e1818 + - path: output/lima/test.fl.lima.report + md5sum: bd4a8bde17471563cf91aab4c787911d + - path: output/lima/test.fl.lima.summary + md5sum: 03be2311ba4afb878d8e547ab38c11eb + +- name: lima test_lima_fa_gz + command: nextflow run tests/modules/lima -entry test_lima_fa_gz -c tests/config/nextflow.config + tags: + - lima + files: + - path: output/lima/test.fl.lima.clips + md5sum: 1012bc8874a14836f291bac48e8482a4 + - path: output/lima/test.fl.lima.counts + md5sum: a4ceaa408be334eaa711577e95f8730e + - path: output/lima/test.fl.lima.guess + md5sum: 651e5f2b438b8ceadb3e06a2177e1818 + - path: output/lima/test.fl.lima.report + md5sum: bd4a8bde17471563cf91aab4c787911d + - path: output/lima/test.fl.lima.summary + md5sum: 03be2311ba4afb878d8e547ab38c11eb + +- name: lima test_lima_fq + command: nextflow run tests/modules/lima -entry test_lima_fq -c tests/config/nextflow.config + tags: + - lima + files: + - path: output/lima/test.fl.NEB_5p--NEB_Clontech_3p.fastq + md5sum: ef395f689c5566f501e300bb83d7a5f2 + - path: output/lima/test.fl.lima.clips + md5sum: 5c16ef8122f6f1798acc30eb8a30828c + - path: output/lima/test.fl.lima.counts + md5sum: 767b687e6eda7b24cd0e577f527eb2f0 + - path: output/lima/test.fl.lima.guess + md5sum: 31b988aab6bda84867e704b9edd8a763 + - path: output/lima/test.fl.lima.report + md5sum: ad2a9b1eeb4cda4a1f69ef4b7520b5fd + - path: output/lima/test.fl.lima.summary + md5sum: e91d3c386aaf4effa63f33ee2eb7da2a + +- name: lima test_lima_fq_gz + command: nextflow run tests/modules/lima -entry test_lima_fq_gz -c tests/config/nextflow.config + tags: + - lima + files: + - path: output/lima/test.fl.NEB_5p--NEB_Clontech_3p.fastq.gz + md5sum: 32c11db85f69a1b4454b6bbd794b6df2 + - path: output/lima/test.fl.lima.clips + md5sum: 5c16ef8122f6f1798acc30eb8a30828c + - path: output/lima/test.fl.lima.counts + md5sum: 767b687e6eda7b24cd0e577f527eb2f0 + - path: output/lima/test.fl.lima.guess + md5sum: 31b988aab6bda84867e704b9edd8a763 + - path: output/lima/test.fl.lima.report + md5sum: ad2a9b1eeb4cda4a1f69ef4b7520b5fd + - path: output/lima/test.fl.lima.summary + md5sum: e91d3c386aaf4effa63f33ee2eb7da2a