Skip to content

Commit 8d02c9d

Browse files
authored
Merge pull request #18 from cokelaer/main
implement bwa_split to analyse large fastq files
2 parents 55c315b + 5ae8a48 commit 8d02c9d

File tree

9 files changed

+248
-81
lines changed

9 files changed

+248
-81
lines changed

.github/workflows/pypi.yml

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -11,20 +11,20 @@ jobs:
1111
runs-on: ubuntu-20.04
1212
steps:
1313
- uses: actions/checkout@main
14-
- name: Set up Python 3.7
15-
uses: actions/setup-python@v1
14+
- name: Set up Python 3.8
15+
uses: actions/setup-python@v2
1616
with:
17-
python-version: 3.7
17+
python-version: 3.8
1818

19-
- name: Install package
19+
- name: Install package
2020
run: |
21-
pip install build
21+
pip install build poetry
2222
2323
- name: Build source tarball
2424
run: |
2525
rm -rf dist;
26-
python setup.py sdist
27-
26+
poetry build
27+
2828
- name: Publish distribution to Test PyPI
2929
uses: pypa/gh-action-pypi-publish@release/v1
3030
with:

.pre-commit-config.yaml

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
2+
files: '\.(py|rst|sh)$'
3+
fail_fast: false
4+
5+
repos:
6+
- repo: https://github.com/pre-commit/pre-commit-hooks
7+
rev: v3.2.0
8+
hooks:
9+
- id: trailing-whitespace
10+
- id: end-of-file-fixer
11+
- id: check-yaml
12+
#- id: check-executables-have-shebangs
13+
- id: check-ast
14+
15+
- repo: https://github.com/pycqa/flake8
16+
rev: 6.1.0
17+
hooks:
18+
- id: flake8
19+
args: ["-j8", "--ignore=E203,E501,W503,E722", "--max-line-length=120", "--exit-zero"]
20+
21+
- repo: https://github.com/psf/black
22+
rev: 22.10.0
23+
hooks:
24+
- id: black
25+
args: ["--line-length=120"]
26+
exclude: E501
27+
28+
- repo: https://github.com/pycqa/isort
29+
rev: 5.12.0
30+
hooks:
31+
- id: isort
32+
args: ["--profile", "black"] # solves conflicts between black and isort
33+

README.rst

Lines changed: 17 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
:target: https://pypi.python.org/pypi/sequana_mapper
44

55
.. image:: https://github.com/sequana/mapper/actions/workflows/main.yml/badge.svg
6-
:target: https://github.com/sequana/mapper/actions/
6+
:target: https://github.com/sequana/mapper/actions/
77

88
.. image:: https://img.shields.io/badge/python-3.8%20%7C%203.9%20%7C3.10-blue.svg
99
:target: https://pypi.python.org/pypi/sequana
@@ -46,7 +46,7 @@ to execute the pipeline::
4646
cd mapper
4747
sh mapper.sh # for a local run
4848

49-
This launch a snakemake pipeline. If you are familiar with snakemake, you can
49+
This launch a snakemake pipeline. If you are familiar with snakemake, you can
5050
retrieve the pipeline itself and its configuration files and then execute the pipeline yourself with specific parameters::
5151

5252
snakemake -s mapper.rules -c config.yaml --cores 4 \
@@ -74,21 +74,21 @@ This pipelines requires the following executable(s):
7474
Details
7575
~~~~~~~~~
7676

77-
This pipeline runs **mapper** in parallel on the input fastq files (paired or not).
78-
A brief sequana summary report is also produced. When using **--pacbio** option,
77+
This pipeline runs **mapper** in parallel on the input fastq files (paired or not).
78+
A brief sequana summary report is also produced. When using **--pacbio** option,
7979
*-x map-pb* options is automatically added to the config.yaml file and the
80-
readtag is set to None.
80+
readtag is set to None.
8181

8282
The BAM files are filtered to remove unmapped reads to keep BAM files to minimal size. However,
83-
the multiqc and statistics to be found in {sample}/bamtools_stats/ includes mapped and unmapped reads information. Each BAM file is stored in a directory named after the sample.
83+
the multiqc and statistics to be found in {sample}/bamtools_stats/ includes mapped and unmapped reads information. Each BAM file is stored in a directory named after the sample.
8484

8585

8686

8787
Rules and configuration details
8888
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
8989

9090
Here is the `latest documented configuration file <https://raw.githubusercontent.com/sequana/mapper/main/sequana_pipelines/mapper/config.yaml>`_
91-
to be used with the pipeline. Each rule used in the pipeline may have a section in the configuration file.
91+
to be used with the pipeline. Each rule used in the pipeline may have a section in the configuration file.
9292

9393

9494
Changelog
@@ -97,6 +97,8 @@ Changelog
9797
========= ======================================================================
9898
Version Description
9999
========= ======================================================================
100+
1.2.0 * Implement a bwa_split method to speed up mapping of very large
101+
fastq files.
100102
1.1.0 * BAM files are now filtered to remove unmapped reads
101103
* set wrappers branch in config file and update pipeline.
102104
* refactorise to use click and new sequana-pipetools
@@ -105,25 +107,25 @@ Version Description
105107
0.11.1 * Fix typo when setting coverage to True and allow untagged filenames
106108
0.11.0 * implement feature counts for capture-seq projects
107109
0.10.1 * remove getlogdir and getname
108-
0.10.0 * use new wrappers framework
110+
0.10.0 * use new wrappers framework
109111
0.9.0 * fix issue with logger and increments requirements
110-
* add new option --pacbio to automatically set the options for
112+
* add new option --pacbio to automatically set the options for
111113
pacbio data (-x map-pb and readtag set to None)
112114
0.8.13 * add the thread option in minimap2 case
113115
0.8.12 * factorise multiqc rule
114116
0.8.11 * Implemente the --from-project option and new framework
115117
* custom HTMrLl report
116118
0.8.10 * change samtools_depth rule and switched to bam2cov to cope with null
117-
coverage
119+
coverage
118120
0.8.9 * fix requirements
119121
0.8.8 * fix pipeline rule for bigwig + renamed output_bigwig into
120122
create_bigwig; fix the multiqc config file
121123
0.8.7 * fix config file creation (for bigwig)
122124
0.8.6 * added bowtie2 mapper + bigwig as output, make coverage optional
123125
0.8.5 * create a sym link to the HTML report. Better post cleaning.
124-
0.8.4 * Fixing multiqc (synchronized with sequana updates)
125-
0.8.3 * add sequana_coverage rule.
126-
0.8.2 * add minimap2 mapper
126+
0.8.4 * Fixing multiqc (synchronized with sequana updates)
127+
0.8.3 * add sequana_coverage rule.
128+
0.8.2 * add minimap2 mapper
127129
0.8.1 * fix bamtools stats rule to have different output name for multiqc
128130
0.8.0 **First release.**
129131
========= ======================================================================
@@ -132,7 +134,6 @@ Version Description
132134
Contribute & Code of Conduct
133135
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
134136

135-
To contribute to this project, please take a look at the
136-
`Contributing Guidelines <https://github.com/sequana/sequana/blob/main/CONTRIBUTING.rst>`_ first. Please note that this project is released with a
137+
To contribute to this project, please take a look at the
138+
`Contributing Guidelines <https://github.com/sequana/sequana/blob/main/CONTRIBUTING.rst>`_ first. Please note that this project is released with a
137139
`Code of Conduct <https://github.com/sequana/sequana/blob/main/CONDUCT.md>`_. By contributing to this project, you agree to abide by its terms.
138-

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "poetry.core.masonry.api"
44

55
[tool.poetry]
66
name = "sequana-mapper"
7-
version = "1.1.0"
7+
version = "1.2.0"
88
description = "A multi-sample mapper to map reads onto a reference"
99
authors = ["Sequana Team"]
1010
license = "BSD-3"

sequana_pipelines/mapper/config.yaml

Lines changed: 26 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
# If input_directory provided, use it otherwise if input_pattern provided,
77
# use it, otherwise use input_samples.
88
# ============================================================================
9-
sequana_wrappers: v23.11.18
9+
sequana_wrappers: v24.1.14
1010

1111
input_directory:
1212
input_readtag: _R[12]_
@@ -31,7 +31,8 @@ apptainers:
3131
minimap2: https://zenodo.org/record/7341710/files/sequana_tools_0.14.5.img
3232
multiqc: https://zenodo.org/record/10205070/files/multiqc_1.16.0.img
3333
samtools: https://zenodo.org/record/7341710/files/sequana_tools_0.14.5.img
34-
sequana_coverage: https://zenodo.org/record/10209929/files/sequana_0.16.1.img
34+
sequana_coverage: https://zenodo.org/record/10460105/files/sequana_0.16.5.img
35+
seqkit: https://zenodo.org/record/7821924/files/seqkit_2.4.0.img
3536
subread: https://zenodo.org/record/7341710/files/sequana_tools_0.14.5.img
3637

3738

@@ -47,6 +48,18 @@ samtools_depth:
4748
resources:
4849
mem: 8G
4950

51+
##############################################################################
52+
# BWA MEM indexing
53+
#
54+
bwa_index:
55+
options: ''
56+
threads: 4
57+
resources:
58+
mem: 8G
59+
60+
##############################################################################
61+
# BAM alignment indexing
62+
#
5063
bam_indexing:
5164
resources:
5265
mem: 8G
@@ -68,25 +81,30 @@ bwa:
6881
resources:
6982
mem: 8G
7083

71-
minimap2:
72-
options: ''
84+
bwa_split:
85+
nreads: 1000000
86+
index_algorithm: is
87+
options: -T 30 -M
7388
threads: 4
89+
tmp_directory: ./tmp
7490
resources:
7591
mem: 8G
7692

77-
bowtie2:
93+
94+
95+
minimap2:
7896
options: ''
7997
threads: 4
8098
resources:
8199
mem: 8G
82100

83-
bowtie2_index:
101+
bowtie2:
84102
options: ''
85103
threads: 4
86104
resources:
87105
mem: 8G
88106

89-
bwa_index:
107+
bowtie2_index:
90108
options: ''
91109
threads: 4
92110
resources:
@@ -158,7 +176,7 @@ sequana_coverage:
158176
## If you want your own multiqc, fill this entry
159177
multiqc:
160178
options: -p -f
161-
modules: busco quast sequana_coverage prokka
179+
modules: sequana_bamtools_stats sequana_coverage
162180
input_directory: .
163181
config_file: multiqc_config.yaml
164182
resources:

sequana_pipelines/mapper/main.py

Lines changed: 29 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -10,30 +10,31 @@
1010
# documentation: http://sequana.readthedocs.io
1111
#
1212
##############################################################################
13-
import sys
1413
import os
14+
import sys
1515

16-
import rich_click as click
1716
import click_completion
17+
import rich_click as click
1818

1919
click_completion.init()
2020

2121
NAME = "mapper"
2222

23-
from sequana_pipetools.options import *
2423
from sequana_pipetools import SequanaManager
24+
from sequana_pipetools.options import *
2525

2626
help = init_click(
2727
NAME,
2828
groups={
2929
"Pipeline Specific": [
30-
"--mapper",
31-
"--reference-file",
30+
"--aligner-choice",
3231
"--annotation-file",
32+
"--capture-annotation-file",
33+
"--create-bigwig",
3334
"--do-coverage",
35+
"--nanopore",
3436
"--pacbio",
35-
"--create-bigwig",
36-
"--capture-annotation-file",
37+
"--reference-file",
3738
],
3839
},
3940
)
@@ -45,7 +46,13 @@
4546
@include_options_from(ClickInputOptions)
4647
@include_options_from(ClickGeneralOptions)
4748
@click.option(
48-
"--mapper", default="bwa", type=click.Choice(["bwa", "minimap2", "bowtie2"]), help="Choose one of the valid mapper"
49+
"--aligner-choice",
50+
"mapper",
51+
default="bwa",
52+
type=click.Choice(["bwa", "bwa_split", "minimap2", "bowtie2"]),
53+
help="""Choose one of the valid mapper. bwa_split is experimental. it first split the fastq files in chunks of 1Mreads,
54+
aligns the reads with bwa and merge back the sub BAM files. Should be equivalent to using bwa but could be used on
55+
cluster to speed up analysis.""",
4956
)
5057
@click.option("--reference-file", required=True, help="You input reference file in fasta format")
5158
@click.option("--annotation-file", help="Used by the sequana_coverage tool if provided")
@@ -108,6 +115,20 @@ def main(**options):
108115
cfg.feature_counts.options = "-F SAF "
109116
cfg.feature_counts.gff = os.path.abspath(options.capture_annotation_file)
110117

118+
# Given the reference, let us compute its length and the index algorithm
119+
from sequana import FastA
120+
121+
f = FastA(cfg.general.reference_file)
122+
N = f.get_stats()["total_length"]
123+
124+
# seems to be a hardcoded values in bwa according to the documentation
125+
if N >= 2000000000:
126+
cfg["bwa"]["index_algorithm"] = "bwtsw"
127+
cfg["bwa_split"]["index_algorithm"] = "bwtsw"
128+
else:
129+
cfg["bwa"]["index_algorithm"] = "is"
130+
cfg["bwa_split"]["index_algorithm"] = "is"
131+
111132
# finalise the command and save it; copy the snakemake. update the config
112133
# file and save it.
113134
manager.teardown()

0 commit comments

Comments
 (0)