Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

-p/--threads option always uses 1 more thread than is specified. #465

Open
leowill01 opened this issue Feb 15, 2024 · 8 comments
Open

-p/--threads option always uses 1 more thread than is specified. #465

leowill01 opened this issue Feb 15, 2024 · 8 comments

Comments

@leowill01
Copy link

leowill01 commented Feb 15, 2024

when i run bowtie2 and specify the number of threads to use(eg -p 1, -p 4), the amount of CPU usage is always 1 more core in use than is specified by the number of threads to use.
for example, if i specify -p 1:

bowtie2 --no-unal -p 1 \
-x 'ref-genome-idx' \
-1 reads-R1_001.fastq.gz \
-2 reads-R2_001.fastq.gz

the cpu usage (via htop) for the process with be at ~200% indicating a usage of 2 threads, not 1:

 CPU% Command
  2.8 └─ /usr/bin/perl5.34 /opt/homebrew/bin/bowtie2 --no-unal -p 1 -x PAO1-- ...
196.0 	└─ /opt/homebrew/bin/../Cellar/bowtie2/2.5.2/bin/bowtie2-align-s --w ...

likewise, for any other number of threads, it always uses n+1 CPU cores.
for -p 4:

bowtie2 --no-unal -p 4 \
-x 'ref-genome-idx' \
-1 reads-R1_001.fastq.gz \
-2 reads-R2_001.fastq.gz

CPU usage (via htop):

 CPU% Command
  2.4 └─ /usr/bin/perl5.34 /opt/homebrew/bin/bowtie2 --no-unal -p 4 -x
445.4 	└─ /opt/homebrew/bin/../Cellar/bowtie2/2.5.2/bin/bowtie2-align

(between 400-500% core usage because i have other jobs running, but bare system it would be ~500% indicating usage of 5 CPU cores)

this is effectively limiting cutting my sample throughput in half because when bowtie2 is in usage with 1 thread specified, every job uses 2 threads instead.
is this a bug or a feature? am i missing something?

tested on:

  • macos sonoma 14.2.1
  • mac studio m2 ultra
  • bowtie2 version:
bowtie2 --version
/opt/homebrew/bin/../Cellar/bowtie2/2.5.2/bin/bowtie2-align-s version 2.5.2
64-bit
Built on Ventura-arm64.local
Sat Oct 14 18:03:18 UTC 2023
Compiler: InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
Options: -O3  -funroll-loops -g3 -std=c++11 -fopenmp-simd -DNO_SPINLOCK -DWITH_QUEUELOCK=1
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}
@ch4rr0
Copy link
Collaborator

ch4rr0 commented Feb 27, 2024

Hello,

As of bowtie2 v2.5.0 we moved to async model (dedicated thread) for processing I/O. The --threads flag specifies the number of computational (alignment) threads that bowtie2 will use, hence the thread count will always be +1 than the default or the value specified.

@sfiligoi
Copy link
Contributor

sfiligoi commented Sep 11, 2024

@ch4rr0 The whole new CPU core usage was added in commit 6f6458c
as it moved from using a conditional variable to pure polling.

bowtie2/pat.h

Line 1344 in 4cf8d52

while (!q_.try_dequeue(ret)) ;

I benchmarked before and after the commit with -p 12, and while before it would use 12.2 cores, afterwards it uses 13.1.

I believe we should revert that commit.

@sfiligoi
Copy link
Contributor

For completeness, I had to revert 3 commits to avoid revert conflicts when testing:

    Revert "Use moodycamel concurrentqueue for better performance"
    
    This reverts commit 6f6458c5dae6ef8931c6024b1a20b5c2e275607d.
...
    Revert "Potential fix for LeakSanitizer failure"
    
    This reverts commit 84afe73740187a10749d236190e1c9056885d3d6.
...
    Revert "Second attempt at fixing LeakSanitizer failure"
    
    This reverts commit d4fb35a7b24a4c8d03b08d0e7672a27a8533fa34.

Not sure if there are any other side effects, but the CPU utilization definitely was drastically reduced.

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Sep 11, 2024

Hello Igor,

We put this change in place after a user reported an issue with poor thread scaling which was due to contention on the condition variable. Why does the n+1 CPU utilization come as a surprise? As I recall it was part of the async reads pull request. moody_camel is a concurrent, lock-free queue and should not spawn any additional threads on it's own.

@sfiligoi
Copy link
Contributor

sfiligoi commented Sep 11, 2024

The async reads were added as a way to increase performance with minimal overhead.
Using a whole CPU just to wait for input defeats the original purpose, especially on low thread count.

The number of CPU cores in a node is limited, so any CPU cycles wasted on polling is CPU time that cannot be used on actual alignment work.

@sfiligoi
Copy link
Contributor

If condition variables were really a problem at large thread count (can you reference me in the related issue, please?)
we could use the moody_camel after a certain thread count (polling is much less of a problem at high thread count)
but revert back to the more efficient condition variables at (more typical) lower thread counts.

@ch4rr0 I am willing to do the coding work, if you agree with the direction.

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Sep 12, 2024

This is the issue: #437.

I am grateful that you are willing to do the work on this. If it's not too much to ask can you also provide some performance numbers for future reference?

@sfiligoi
Copy link
Contributor

sfiligoi commented Sep 12, 2024

As a starting point, here are my benchmark numbers on the kind of problems of relevance to my group (QIITA).

The used CPU is a AMD EPYC 7302 16-Core Processor.
(Given the size of the problems, we almost always use server-class setups. This node has 2 CPUs, using the 2nd one.)

I am using the Wolka WolR1 DB (~40GB of memory use) and the Song53_24613_host_filtered.fastq.gz input file from QUIITA:
https://qiita.ucsd.edu/public/?artifact_id=101636

The command used for a single 16-core run is:

INDIR=/qmounts/qiita_data/per_sample_FASTQ/101636
INF1=Song53_24613_host_filtered.fastq.gz
time taskset -c 16-31  ./bowtie2 -p 16 -x /scratch/qp-woltka/WoLr1/WoLr1 -q ${INDIR}/${INF1},${INDIR}/${INF1},${INDIR}/${INF1},${INDIR}/${INF1} --seed 42 --very-sensitive -k 16 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min  "L,0,-0.05" --no-head --no-unal >/dev/null

(aligning 4x the same file, so I can compare with aligning them in parallel below)

and for the 4x4-core runs I use

(time taskset -c 16-19 ./bowtie2 -p 4 -x /scratch/qp-woltka/WoLr1/WoLr1 -q ${INDIR}/${INF1} --seed 42 --very-sensitive -k 16 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min  "L,0,-0.05" --no-head --no-unal >/dev/null)&
(time taskset -c 20-23 ./bowtie2 -p 4 -x /scratch/qp-woltka/WoLr1/WoLr1 -q ${INDIR}/${INF1} --seed 42 --very-sensitive -k 16 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min  "L,0,-0.05" --no-head --no-unal >/dev/null)&
(time taskset -c 24-27 ./bowtie2 -p 4 -x /scratch/qp-woltka/WoLr1/WoLr1 -q ${INDIR}/${INF1} --seed 42 --very-sensitive -k 16 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min  "L,0,-0.05" --no-head --no-unal >/dev/null)&
(time taskset -c 28-31 ./bowtie2 -p 4 -x /scratch/qp-woltka/WoLr1/WoLr1 -q ${INDIR}/${INF1} --seed 42 --very-sensitive -k 16 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min  "L,0,-0.05" --no-head --no-unal >/dev/null)&
wait

With the current master, I get:
1x16 - 740s
2x8 - 790s
4x4 - 950s
(as expected, using -p N-1 gives almost identical timings)

If I revert the moodycamel commits I get:
1x16 - 690s
2x8 - 700s
4x4 - 740s

Two takeways:
a) It is more effective to use a single process using 16 cores, that many processes using a subset of the cores. i.e., I don't see any slowdown with larger thread numbers, quite the opposite.
b) The version without moodycamel is always faster, but is most notable when -p is smaller. The original async code was indeed adding only a minor overhead, while the polling moodycamel approach slows small -p setups drastically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants