Commit graph

2 commits

Author SHA1 Message Date
Charles Plessy
45f2f1ee5f
Input a triple (id, fasta, params) to last/lastal (#563)
The `last/lastal` submodule takes query sequences to align to a target
index, and optionally takes one set of alignment parameters (including a
score matrix) computed by the `last/train` module for each of the
sequences.

In the previous implementation the sequences and the alignment
parameters were provided in different channels, causing them to be
sometimes desynchronised.

In the patched implementation, `last/lastal` takes a 3-tuple as
input to ensure synchronicity.  To produce this tuple in a pipeline,
one can use the `join` command as in the following example.

     LAST_TRAIN  ( query,
                   target )
     LAST_LASTAL ( query.join(LAST_TRAIN.out.param_file),
                   target )

In case no parameter file is computed one can pass a dummy file
to the module as follows:

     LAST_LASTAL ( query.map { row -> [ row[0], row[1], [] ] },
                   target )
2021-07-06 09:35:04 +01:00
Charles Plessy
207930139a
New last/lastal module to align query sequences on a target index (#510)
* New last/lastal to align query sequences on a target index

`lastal` is the main program of the [LAST](https://gitlab.com/mcfrith/last)
suite.  It align query DNA sequences in FASTA or FASTQ format to a
target index of DNA or protein sequences.  The index is produced by
the `lastdb` program (module `last/lastdb`).  The score matrix for
evaluating the alignment can be chosen among preset ones or computed
iteratively by the `last-train` program (module `last/train`).  For
this reason, the `last/lastal` module proposed here has one input
channel containing an optional file, that has to be dummy when not used.

The LAST aligner outputs MAF files that can be very large (up to
hundreds of gigabytes), therefore this module unconditionally compresses
its output with gzip.

This new module is part of the work described in Issue #464.  During
this development, we fix the version of LAST to 1219 to ensure
consistency (hence ignore lint's version warning).

* Apply suggestions from code review

Co-authored-by: Harshil Patel <drpatelh@users.noreply.github.com>

* Un-hardcode the path to the LAST index.

Among multiple alternatives I have chosen the following command to
detect the sample name of the index, because it fails in situations
where there is no index files in the index folder, and in situations
were there are two indexes files in the folder.  Not failing would
result in feeding garbage information in the INDEX_NAME variable.

    basename \$(ls $index/*.bck) .bck

In case of missing file, a clear error message is given by `ls`.  In
case of more than one file, the error message of `basename` is more
cryptic, unfortunately.  (`basename: extra operand ‘.bck’`)

Alternatives that do not fail if there is no .bck file:

    basename $index/*bck .bck
    find $index -name '*bck' | sed 's/.bck//'

Alternatives that do not fail if there are more than one .bck file:

    basename -s .bck $index/*bck
    ls $index/*.bck | xargs basename -s .bck
    find $index -name '*bck' | sed 's/.bck//'

Co-authored-by: Harshil Patel <drpatelh@users.noreply.github.com>
2021-05-25 22:10:48 +01:00