Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
version: 1.2
workflows:
- name: main
subclass: Galaxy
publish: true
primaryDescriptorPath: /host-or-contamination-removal-on-long-reads.ga
testParameterFiles:
- /host-or-contamination-removal-on-long-reads-tests.yml
authors:
- name: Paul Zierep
orcid: 0000-0003-2982-388X
- name: "B\xE9r\xE9nice Batut"
orcid: 0000-0001-9852-1987
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Changelog

## [0.1] yyyy-mm-dd

First release.
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Host or Contamination removal on long-reads

The extraction of microbiome DNA or RNA is usually contaminated by host and human DNA or RNA (but also other contaminant). It is an important to get rid of all host/contamination sequences and to only retain microbiome sequences, both in order to speed up further steps and to avoid host/contamination sequences compromising the analysis.

This workflow takes Nanopore fastq(.gz) files and executes the following steps:
1. Mapping of the reads against a reference genome of the host or contaminant (e.g. human) using **Minimap 2**,
2. Filtering of the generated BAM using **BAMtools** and **Samtools** to keep only the reads that do not align,
3. Generation of mapping statistics using **QualiMap**,
2. Aggregation of the mapping statistics using **MultiQC**

## Input Datasets

- A list of datasets corresponding to reads in `fastqsanger` or `fastqsanger.gz` format.
- Reference genome
- Profile for mapping

## Output Datasets

- A list of datasets corresponding to unmapped reads in `fastqsanger` or `fastqsanger.gz`.
- A list of reports of QualiMap for each sample that could be used as inputs for extra MultiQC
- MultiQC report of the mapping statistics in HTML
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
- doc: Test outline for host-or-contamination-removal-on-long-reads
job:
Long-reads:
class: Collection
collection_type: list
elements:
- class: File
identifier: Spike3bBarcode10
location: https://zenodo.org/record/12190648/files/collection_of_all_samples_Spike3bBarcode10.fastq.gz
filetype: fastqsanger.gz
- class: File
identifier: Spike3bBarcode12
location: https://zenodo.org/record/12190648/files/collection_of_all_samples_Spike3bBarcode12.fastq.gz
filetype: fastqsanger.gz
Host/Contaminant Reference Genome (long-reads): hg38
Profile of preset options for the mapping (long-read): map-pb
outputs:
qualimap_stats:
element_tests:
Spike3bBarcode10:
elements:
genome_results:
asserts:
has_text:
text: "Spike3bBarcode10"
has_text:
text: "3,209,286,105 bp"
coverage_across_reference:
asserts:
has_text:
text: "#Position (bp)"
has_n_lines:
value: 854
coverage_histogram:
asserts:
has_text:
text: "Number of genomic locations"
has_n_lines:
value: 7
genome_fraction_coverage:
asserts:
has_text:
text: "#Coverage (X)"
has_n_lines:
value: 151
duplication_rate_histogram:
asserts:
has_text:
text: "#Duplication rate"
has_text:
text: "104.0"
homopolymer_indels:
asserts:
has_text:
text: "#Type of indel"
has_text:
text: "polyN"
insert_size_across_reference:
asserts:
has_size:
value: 0
insert_size_histogram:
asserts:
has_size:
value: 0
mapped_reads_clipping_profile:
asserts:
has_text:
text: "#Read position (bp)"
has_text:
text: "6.161988"
mapped_reads_gc-content_distribution:
asserts:
has_text:
text: "#GC Content (%)"
has_n_lines:
value: 100
mapped_reads_nucleotide_content:
asserts:
has_text:
text: "16.666666"
mapping_quality_across_reference:
asserts:
has_text:
text: "Filtered Reads"
has_n_lines:
value: 854
mapping_quality_histogram:
asserts:
has_text:
text: "#Mapping quality"
has_n_lines:
value: 41
Spike3bBarcode12:
elements:
genome_results:
asserts:
has_text:
text: "Spike3bBarcode12"
has_text:
text: "3,209,286,105 bp"
coverage_across_reference:
asserts:
has_text:
text: "#Position (bp)"
has_n_lines:
value: 100
coverage_histogram:
asserts:
has_text:
text: "Number of genomic locations"
has_n_lines:
value: 4
genome_fraction_coverage:
asserts:
has_text:
text: "#Coverage (X)"
has_n_lines:
value: 51
duplication_rate_histogram:
asserts:
has_text:
text: "#Duplication rate"
has_text:
text: "119.0"
homopolymer_indels:
asserts:
has_text:
text: "#Type of indel"
has_text:
text: "polyN"
insert_size_across_reference:
asserts:
has_size:
value: 0
insert_size_histogram:
asserts:
has_size:
value: 0
mapped_reads_clipping_profile:
asserts:
has_text:
text: "#Read position (bp)"
has_text:
text: "2.273913"
mapped_reads_gc-content_distribution:
asserts:
has_text:
text: "#GC Content (%)"
has_n_lines:
value: 100
mapped_reads_nucleotide_content:
asserts:
has_text:
text: "16.666666"
mapping_quality_across_reference:
asserts:
has_text:
text: "Filtered Reads"
has_n_lines:
value: 854
mapping_quality_histogram:
asserts:
has_text:
text: "#Mapping quality"
has_n_lines:
value: 37
multiqc_html_report:
asserts:
has_text:
text: "Spike3bBarcode10"
has_text:
text: "Spike3bBarcode12"
samtools_fastx:
element_tests:
Spike3bBarcode10:
asserts:
has_text:
text: "@0a0c4d2c-291f-46a4-87d5-625efbfed6a0"
Spike3bBarcode12:
asserts:
has_text:
text: "@0a0c4e88-893a-4284-9119-ab4274e05445"
Loading
Loading