Converge test data usage (#249)
* initial data restructuing
* fixed bedtools_complement
* fixed bedtools_genomecov
* fixed bedtools_getfasta
* fixed bedtools_intersect
* fixed bedtools maskfasta
* fixed bedtools_merge
* fixed bedtools_slop
* fixed bedtools_sort
* fixed bismark_genome_preparation
* fixed blast
* fixed bowtie data
* fixed bowtie2 data
* fixed bwa data
* fixed bwamem2 data usage
* fixed cat_fastq data
* fixed cutadapt data
* fixed dsh data
* fixed fastp data
* fixed fastqc; fixed bug with wrong fastq format
* fixed gatk
* fixed data for gffread, gunzip
* fixed ivar paths
* fixed data paths for minimap2
* fixed mosdepth
* fixed multiqc, pangolin
* fixed picard data paths
* fixed data paths for qualimap, quast
* fixed salmon data paths
* fixed samtools paths
* fixed seqwish, stringtie paths
* fixed tabix, trimgalore paths
* cleaned up data
* added first description to README
* changed test data naming again; everything up to bwa fixed
* everything up to gatk4
* fixed everything up to ivar
* fixed everything up to picard
* everything up to quast
* everything fixed up to stringtie
* switched everyting to 'test' naming scheme
* fixed samtools and ivar tests
* cleaned up README a bit
* add (simulated) methylation test data
based on SARS-CoV-2 genome; simulated with Sherman --non_dir --genome sarscov2/fasta/ --paired -n 10000 -l 100 --CG 20 --CH 90
* bwameth/align: update data paths and checksums
also, build index on the go
* bwameth/index: update data paths and checksums
* methyldackel/extract: update data paths and checksums
* methyldackel/mbias: update data paths and checksums
* bismark/deduplicate: update data paths and checksums
* remove obsolete testdata
* remove empty 'dummy_file.txt'
* update data/README.md
* methyldackel: fix test
* Revert "methyldackel: fix test"
This reverts commit f175a32d144b1b0bfa0c6885da80c51e3cfe038a.
* methyldackel: fix test
for real
* move test.genome.sizes
* changed test names
* switched genomic to genome and transcriptome
* fix bedtools, blast
* fix gtf, tabix, .paf
* fix bowtie,bwa,bwameth
* fixed: bwa, bwamem, gatk, gffread, quast
* fixed bismark and blast
* fixed remaining tests
* delete bam file
Co-authored-by: phue <patrick.huether@gmail.com>
2021-03-04 10:10:57 +00:00
# Modules Test Data
This directory contains all data used for the individual module tests. It is currently organised in `genomics` and `generic` . The former contains all typical data required for genomics modules, such as fasta, fastq and bam files. Every folder in `genomics` corresponds to a single organisms. Any other data is stored in `generic` . This contains files that currently cannot be associated to a genomics category, but also depreciated files which will be removed in the future and exchanged by files in `genomics` .
When adding a new module, please check carefully whether the data necessary for the tests exists already in `tests/data/genomics` . If you can't find the data, please ask about it in the slack #modules channel.
## Data Description
### genomics
* sarscov2
* bam:
* 'test_{,methylated}_paired_end.bam': sarscov2 sequencing reads aligned against test_genomic.fasta using minimap2
* 'test_{,methylated}_paired_end.sorted.bam': sorted version of the above bam file
* 'test_{,methylated}_paired_end.bam.sorted.bam.bai': bam index for the sorted bam file
* 'test_single_end.bam': alignment (unsorted) of the 'test_1.fastq.gz' reads against test_genomic.fasta using minimap2
* bed
* 'test.bed': exemplary bed file for the MT192765.1 genome (fasta/test_genomic.fasta)
* 'test.2.bed': slightly modified copy of the above file
* 'test.bed.gz': gzipped version
* 'test.genome.sizes': genome size for the MT192765.1 genome
* fasta
* 'test_genomic.fasta': MT192765.1 genomem including (GCA_011545545.1_ASM1154554v1)
* 'test_genomic.dict': GATK dict for 'test_genomic.fasta'
* 'test_genomic.fasta.fai': fasta index for 'test_genomic.fasta'
* 'test_cds_from_genomic.fasta': coding sequencing from MT192765.1 genome (transcripts)
* fastq
* 'test_{1,2}.fastq.gz' sarscov2 paired-end sequencing reads
* 'test_{1,2}.2.fastq.gz‘ : copies of the above reads
* 'test_methylated_{1,2}.fastq.gz' sarscov2 paired-end bisulfite sequencing reads (generated with [Sherman ](https://github.com/FelixKrueger/Sherman ))
* gtf
* 'test_genomic.gtf': GTF for MT192765.1 genome
* 'test_genomic.gff3': GFF for MT192765.1 genome
* 'test_genomic.gff3.gz': bgzipped-version
* paf
* 'test_cds_from_genomic.paf': PAF file for MT192765.1 genome
2021-03-09 09:04:08 +00:00
* vcf
* 'test.vcf', 'test2.vcf': generated from 'test_paired_end.sorted.bam' using bcftools mpileup, call and filter
* 'test3.vcf': generated from 'test_single_end.sorted.bam' using bcftools mpileup, call and filter
* '*.gz': generated from VCF files using bgzip
* '.tbi': generated from '.vcf.gz' files using `tabix -p vcf -f <file>`
Converge test data usage (#249)
* initial data restructuing
* fixed bedtools_complement
* fixed bedtools_genomecov
* fixed bedtools_getfasta
* fixed bedtools_intersect
* fixed bedtools maskfasta
* fixed bedtools_merge
* fixed bedtools_slop
* fixed bedtools_sort
* fixed bismark_genome_preparation
* fixed blast
* fixed bowtie data
* fixed bowtie2 data
* fixed bwa data
* fixed bwamem2 data usage
* fixed cat_fastq data
* fixed cutadapt data
* fixed dsh data
* fixed fastp data
* fixed fastqc; fixed bug with wrong fastq format
* fixed gatk
* fixed data for gffread, gunzip
* fixed ivar paths
* fixed data paths for minimap2
* fixed mosdepth
* fixed multiqc, pangolin
* fixed picard data paths
* fixed data paths for qualimap, quast
* fixed salmon data paths
* fixed samtools paths
* fixed seqwish, stringtie paths
* fixed tabix, trimgalore paths
* cleaned up data
* added first description to README
* changed test data naming again; everything up to bwa fixed
* everything up to gatk4
* fixed everything up to ivar
* fixed everything up to picard
* everything up to quast
* everything fixed up to stringtie
* switched everyting to 'test' naming scheme
* fixed samtools and ivar tests
* cleaned up README a bit
* add (simulated) methylation test data
based on SARS-CoV-2 genome; simulated with Sherman --non_dir --genome sarscov2/fasta/ --paired -n 10000 -l 100 --CG 20 --CH 90
* bwameth/align: update data paths and checksums
also, build index on the go
* bwameth/index: update data paths and checksums
* methyldackel/extract: update data paths and checksums
* methyldackel/mbias: update data paths and checksums
* bismark/deduplicate: update data paths and checksums
* remove obsolete testdata
* remove empty 'dummy_file.txt'
* update data/README.md
* methyldackel: fix test
* Revert "methyldackel: fix test"
This reverts commit f175a32d144b1b0bfa0c6885da80c51e3cfe038a.
* methyldackel: fix test
for real
* move test.genome.sizes
* changed test names
* switched genomic to genome and transcriptome
* fix bedtools, blast
* fix gtf, tabix, .paf
* fix bowtie,bwa,bwameth
* fixed: bwa, bwamem, gatk, gffread, quast
* fixed bismark and blast
* fixed remaining tests
* delete bam file
Co-authored-by: phue <patrick.huether@gmail.com>
2021-03-04 10:10:57 +00:00
### generic
* 'a.gff3.gz': bgzipped gff3 file currently necessary for TABIX test
* bedgraph: bedgraph files for seacr
* fasta: additional fasta file currently necessary for STAR
* fastq: additional fastq files currently necessary for STAR
* gtf: additional gtf file for STAR
* vcf: several VCF files for tools using those, will be removed in the future
* 'test.txt.gar.gz' exemplary tar file for the untar module